September 01, 2019 ⋅ 5 min read ⋅ Courses

- No notes because it was review or uninteresting

- k-Nearest Neighbors
- Classifies point based on “k” nearest neighbors
- k is always an odd number to prevent ties
- Higher k = Largest distance for search
- Pros
- Easy to understand
- Reasonable performance given simplicity

- Cons
- Poor performance on sparse datasets
- Can be slow on large datasets

- Generalization
- Underfit: predictions are not specific enough
- Overfit: predictions are too closely modeled on the data

- Classification vs Regression
- Classification: discrete (type/class/label)
- Regression: continuous (target/response/value)

- k-Neighbors Regression
- Average the k nearest neighbors to the input to predict its target

- Evaluating regression
- R^2 (Coefficient of determination)
- R^2 = Explained variance / Total variance
- Between 0 and 1
- 1 = Perfect prediction
- 0 = Only predicts the mean of the data

- R^2 (Coefficient of determination)
- Linear Regression Models
- Linear regression: fits line to data
- Minimizes the mean square error
- Pros
- Simple

- Cons
- Prone to underfitting
- No way to control model complexity because no parameters

- Linear regression: fits line to data

- Ridge regression: linear regression but with weight penalty
- Regularization
- Regularization: explicitly restricting a model to prevent overfitting
- Alpha parameter: controls the tradeoff between simplicity and performance of a model
- Tries to make the weight matrix close to zero
- Each feature should have as little effect on the out while still predicting well
- Regularization becomes less important as we have more data
- L1 Regularization (Lasso): penalizes the absolute size of the weight
- L2 Regularization (Ridge): penalizes the squared size of the weight

- Linear Classifiers
- Classification vs regression = decision boundary vs prediction values
- Logistic regression: predicts the probability of the input in a class (between 0 and 1)

- Linear Support Vector Machines (lSVC): decision boundary is the support vector
- One-vs-rest: binary relevance, build a binary classifier for each class
- Clustering
- Clustering: grouping data into partitions
- Points in a cluster should be similar, points in a different cluster should be different

- k-Means Clustering
- Finds cluster centers that represent the data
- Steps

- Assign each point to the closest cluster center
- Update the cluster centers to the mean of the points assigned to it

- Pros
- Fast
- Easy to understand and implement

- Cons
- Susceptible to falling into local minimums
- We don’t know how many cluster centers to give it
- Assumes all clusters have the same diameter
- Assumes all clusters are omni-directional

- Agglomerative Clustering
- Assigns each point as its own cluster and merges clusters until the number of desired clusters is reached
- Merging function
- Ward: minimize variance between clusters
- Average: smallest average distance between cluster’s points
- Complete: smallest maximum distance between cluster’s points

- Density-Based Spatial Clustering of Applications with Noise (DBSCAN): identifies points in crowded/dense regions of the feature space
- Core sample: points in a dense region
- If there are at least min_samples many data points within a distance of eps to a given data point, then that data point is classified as a core sample
- Algorithm

- Pick random sample
- Check if there are min number of samples within distance of eps
- If no, then label sample as noise
- If yes, then label sample as core sample and assigned a new cluster label
- All samples within the distance of the core sample are also given the cluster label
- If the samples are core samples, all neighbors are visited recursively

- Three kinds of points

- Noise
- Core samples
- Boundary points: samples within eps distance of core samples

- Increasing eps means more points in a cluster but fewer cluster
- Increasing min samples means more noise points but fewer core points
- Eps implicitly controls the number of clusters
- Min samples controls the minimum cluster size

- Evaluating Clustering Algorithms
- Without ground truth
- Can’t use accuracy
- Silhouette score: computes how similar a point is to its own cluster compared to other clusters

- With ground truth
- Adjusted rand index (ARI): computes measure of agreement/disagreement between and within clusters

- Without ground truth

- Scaling
- Standard: mean = 0 and variance = 1
- Robust: median = 0 and quartile = 1
- MinMax: values between 0 and 1

- Normalizer: projects data onto a circle
- Fit the scaler on the training data, apply it to both the training and testing data
- Never fit the scaler on the testing data because it will perform an improper scaling and leaks data from the test set into the training set
- Dimensionality Reduction
- Reduce the data dimension by compressing it
- Principle Component Analysis (PCA): decomposes dataset into dimensions that explain the maximum amount of variance
- Eigenvector with the largest eigenvalue is the principle component

- One-hot encoding: associating a unique integer index to each unique word
- Corpus = Dataset
- Information Retrieval (IR)
- Natural Language Processing (NLP)
- Text Data as a Bag of Words
- Discard document structure (paragraphs, chapters, sentences)
- Only count how often each word appears in each text in the corpus

- Tokenization: splitting a document into words/tokens
- Vocabulary building: collect vocabulary of all words
- Encoding: counting how often a word appears

- Count vectorizer = tokenization
- Improve tokenization
- Add lower bound on how many times a word appears

- Normalization
- Spelling
- Stemming
- Lemmatization
- Lower/Upper case
- Stop words: discarding words that appear to frequently to be informative (E.g. “a”, “the”, “i”)

- Term frequency–inverse document frequency (TF–IDF): give high weight to any term that appears often in a particular document, but not in many documents
- TF-IDF = TF * IDF
- Term frequency (TF): the number of times a word appears in a document
- Inverse document frequency (IDF): total documents / number of documents that the term appears in
- TF-IDF gives more weight to words the distinguish documents
- Low TF-IDF = words appears in many documents
- High TF-IDF = word appears in select documents often

- n-Grams: overlapping sequence of words that preserves some order
- Latent Dirichlet Allocation (LDA): finds groups of words/topics

that appear frequently together - Clustering algorithm - Each document is a “mixture” of a subset of the topics

- Two types of classifiers
- Discriminative: learn
`p(y|X)`

, learn output given input - Generative: learn
`p(y)`

, learn`p(X|y)`

, create input given class - Discriminative learns the decision boundary
- Generative: learns the distribution of the input data

- Discriminative: learn
- Naive Bayes
- Generative classifier
- Uses class probability and data point given class probability to predict class given data point
- Bayes theorem used to flip conditional probability
- Different naive bayes classifiers depending on assumed data point given class probability distribution
- Pros
- When assumptions match data
- Good for well-separated data, high dimensional data
- Fast and easily interpretable

- Cons
- Poor when a lot of features equals zero
- Correlation between attributes can not be captured

- Decision tree (DT): a tree where each node is a decision
- Performs both classification and regression
- Questions are used to make a decision at each node
- Minimize impurity
- Impurity: Gini-index and cross-entropy

- DTs learn the decision boundary by recursively partitioning the space in a manner that maximizes the information gain
- Reducing overfitting
- Limit maximum depth
- Limit number of leaf nodes
- Set min number of points needed to satisfy condition before splitting

- Regression tree: DTs but for regression
- Impurity: mean square error or mean absolute error
- Predict mean

- Ensembles
- Ensembles: combining different models to achieve better performance (Frankenstein models)
- Voting classifier: uses votes from each model to classify data
- Hard vote: uses discrete votes
- Soft vote: uses probabilities

- Averaging methods
- Build several ensembles from random subsets of the original dataset and aggregate their predictions
- Bagging: random subsets of the samples with replacement
- Bootstrap Aggregation (Bagging): generic way to build different models
- Random forests: a collection of decision tree trained on different subsets of samples and features

- Boosting methods
- Boosting: refers to a family of algorithms which converts weak learner to strong learners

- Model evaluation
- Cross validation
- The data is split repeatedly and multiple models are trained
- k-fold cross-validation: split data into k partitions and train k times

- Grid search
- Trying all possible combinations of the parameters of interest
- Done before cross validation

- Evaluation metrics
- Accuracy: (TP + TN) / Total. Fraction of correctly classified examples
- Business metrics
- False positives/negatives
- Confusion matrices
- Precision: TP / (TP + FP). Used to limit false positives
- Recall: TP / (TP + FN). Used to limit false negatives
- f-measure: combination of precision and recall

- Cross validation

- Feature engineering
- Use expert knowledge
- Domain expertise: specific knowledge in a domain
- Data representation
- One-hot encoding: turn data into a vector where 1 represents its existence
- Binning: turn continuous data into discrete data by categorizing it
- Interactions: add the original feature to the binned data
- Polynomials: use polynomial regression
- Transformations: modify the data using some math function