CR4-DL

ENSF 519: Applied Data Science

By Brian Pho
September 01, 2019 ⋅ 5 min read ⋅ Courses

L1 - L7: Python, Pandas, Numpy Introduction

No notes because it was review or uninteresting

L8: Introduction to ML

k-Nearest Neighbors
- Classifies point based on “k” nearest neighbors
- k is always an odd number to prevent ties
- Higher k = Largest distance for search
- Pros
  - Easy to understand
  - Reasonable performance given simplicity
- Cons
  - Poor performance on sparse datasets
  - Can be slow on large datasets
Generalization
- Underfit: predictions are not specific enough
- Overfit: predictions are too closely modeled on the data
Classification vs Regression
- Classification: discrete (type/class/label)
- Regression: continuous (target/response/value)
k-Neighbors Regression
- Average the k nearest neighbors to the input to predict its target
Evaluating regression
- R^2 (Coefficient of determination)
  - R^2 = Explained variance / Total variance
  - Between 0 and 1
  - 1 = Perfect prediction
  - 0 = Only predicts the mean of the data
Linear Regression Models
- Linear regression: fits line to data
  - Minimizes the mean square error
  - Pros
    - Simple
  - Cons
    - Prone to underfitting
    - No way to control model complexity because no parameters

L9: Regression

Ridge regression: linear regression but with weight penalty
Regularization
- Regularization: explicitly restricting a model to prevent overfitting
- Alpha parameter: controls the tradeoff between simplicity and performance of a model
- Tries to make the weight matrix close to zero
- Each feature should have as little effect on the out while still predicting well
- Regularization becomes less important as we have more data
- L1 Regularization (Lasso): penalizes the absolute size of the weight
- L2 Regularization (Ridge): penalizes the squared size of the weight
Linear Classifiers
- Classification vs regression = decision boundary vs prediction values
- Logistic regression: predicts the probability of the input in a class (between 0 and 1)

L10: SVM and Clustering

Linear Support Vector Machines (lSVC): decision boundary is the support vector
One-vs-rest: binary relevance, build a binary classifier for each class
Clustering
- Clustering: grouping data into partitions
- Points in a cluster should be similar, points in a different cluster should be different
k-Means Clustering
- Finds cluster centers that represent the data
- Steps
1. Assign each point to the closest cluster center
2. Update the cluster centers to the mean of the points assigned to it
- Pros
  - Fast
  - Easy to understand and implement
- Cons
  - Susceptible to falling into local minimums
  - We don’t know how many cluster centers to give it
  - Assumes all clusters have the same diameter
  - Assumes all clusters are omni-directional
Agglomerative Clustering
- Assigns each point as its own cluster and merges clusters until the number of desired clusters is reached
- Merging function
  - Ward: minimize variance between clusters
  - Average: smallest average distance between cluster’s points
  - Complete: smallest maximum distance between cluster’s points

L11: Clustering

Density-Based Spatial Clustering of Applications with Noise (DBSCAN): identifies points in crowded/dense regions of the feature space
- Core sample: points in a dense region
- If there are at least min_samples many data points within a distance of eps to a given data point, then that data point is classified as a core sample
- Algorithm
1. Pick random sample
2. Check if there are min number of samples within distance of eps
3. If no, then label sample as noise
4. If yes, then label sample as core sample and assigned a new cluster label
5. All samples within the distance of the core sample are also given the cluster label
6. If the samples are core samples, all neighbors are visited recursively
- Three kinds of points
1. Noise
2. Core samples
3. Boundary points: samples within eps distance of core samples
- Increasing eps means more points in a cluster but fewer cluster
- Increasing min samples means more noise points but fewer core points
- Eps implicitly controls the number of clusters
- Min samples controls the minimum cluster size
Evaluating Clustering Algorithms
- Without ground truth
  - Can’t use accuracy
  - Silhouette score: computes how similar a point is to its own cluster compared to other clusters
- With ground truth
  - Adjusted rand index (ARI): computes measure of agreement/disagreement between and within clusters

L12: Preprocessing & Dimensionality Reduction

Scaling
- Standard: mean = 0 and variance = 1
- Robust: median = 0 and quartile = 1
- MinMax: values between 0 and 1
Normalizer: projects data onto a circle
Fit the scaler on the training data, apply it to both the training and testing data
Never fit the scaler on the testing data because it will perform an improper scaling and leaks data from the test set into the training set
Dimensionality Reduction
- Reduce the data dimension by compressing it
- Principle Component Analysis (PCA): decomposes dataset into dimensions that explain the maximum amount of variance
  - Eigenvector with the largest eigenvalue is the principle component

L13: Textual Data

One-hot encoding: associating a unique integer index to each unique word
Corpus = Dataset
Information Retrieval (IR)
Natural Language Processing (NLP)
Text Data as a Bag of Words
- Discard document structure (paragraphs, chapters, sentences)
- Only count how often each word appears in each text in the corpus
Tokenization: splitting a document into words/tokens
Vocabulary building: collect vocabulary of all words
Encoding: counting how often a word appears

L14: TFIDF

Count vectorizer = tokenization
Improve tokenization
- Add lower bound on how many times a word appears
Normalization
- Spelling
- Stemming
- Lemmatization
- Lower/Upper case
- Stop words: discarding words that appear to frequently to be informative (E.g. “a”, “the”, “i”)
Term frequency–inverse document frequency (TF–IDF): give high weight to any term that appears often in a particular document, but not in many documents
- TF-IDF = TF * IDF
- Term frequency (TF): the number of times a word appears in a document
- Inverse document frequency (IDF): total documents / number of documents that the term appears in
- TF-IDF gives more weight to words the distinguish documents
- Low TF-IDF = words appears in many documents
- High TF-IDF = word appears in select documents often
n-Grams: overlapping sequence of words that preserves some order
Latent Dirichlet Allocation (LDA): finds groups of words/topics

that appear frequently together - Clustering algorithm - Each document is a “mixture” of a subset of the topics

L15: Naive Bayes

Two types of classifiers
- Discriminative: learn p(y|X), learn output given input
- Generative: learn p(y), learn p(X|y), create input given class
- Discriminative learns the decision boundary
- Generative: learns the distribution of the input data
Naive Bayes
- Generative classifier
- Uses class probability and data point given class probability to predict class given data point
- Bayes theorem used to flip conditional probability
- Different naive bayes classifiers depending on assumed data point given class probability distribution
- Pros
  - When assumptions match data
  - Good for well-separated data, high dimensional data
  - Fast and easily interpretable
- Cons
  - Poor when a lot of features equals zero
  - Correlation between attributes can not be captured

L16: Decision Trees

Decision tree (DT): a tree where each node is a decision
Performs both classification and regression
Questions are used to make a decision at each node
Minimize impurity
- Impurity: Gini-index and cross-entropy
DTs learn the decision boundary by recursively partitioning the space in a manner that maximizes the information gain
Reducing overfitting
- Limit maximum depth
- Limit number of leaf nodes
- Set min number of points needed to satisfy condition before splitting
Regression tree: DTs but for regression
- Impurity: mean square error or mean absolute error
- Predict mean
Ensembles
- Ensembles: combining different models to achieve better performance (Frankenstein models)
- Voting classifier: uses votes from each model to classify data
  - Hard vote: uses discrete votes
  - Soft vote: uses probabilities
- Averaging methods
  - Build several ensembles from random subsets of the original dataset and aggregate their predictions
  - Bagging: random subsets of the samples with replacement
  - Bootstrap Aggregation (Bagging): generic way to build different models
  - Random forests: a collection of decision tree trained on different subsets of samples and features
- Boosting methods
  - Boosting: refers to a family of algorithms which converts weak learner to strong learners

L17: Model Evaluation

Model evaluation
- Cross validation
  - The data is split repeatedly and multiple models are trained
  - k-fold cross-validation: split data into k partitions and train k times
- Grid search
  - Trying all possible combinations of the parameters of interest
  - Done before cross validation
- Evaluation metrics
  - Accuracy: (TP + TN) / Total. Fraction of correctly classified examples
  - Business metrics
  - False positives/negatives
  - Confusion matrices
  - Precision: TP / (TP + FP). Used to limit false positives
  - Recall: TP / (TP + FN). Used to limit false negatives
  - f-measure: combination of precision and recall

L18: Feature Engineering

Feature engineering
- Use expert knowledge
- Domain expertise: specific knowledge in a domain
- Data representation
  - One-hot encoding: turn data into a vector where 1 represents its existence
  - Binning: turn continuous data into discrete data by categorizing it
  - Interactions: add the original feature to the binned data
  - Polynomials: use polynomial regression
  - Transformations: modify the data using some math function

Previous
SENG 511: Software Process and Project Management Next
CPSC 413: Design & Analysis of Algorithms I