Only count how often each word appears in each text in the corpus
Tokenization: splitting a document into words/tokens
Vocabulary building: collect vocabulary of all words
Encoding: counting how often a word appears
L14: TFIDF
Count vectorizer = tokenization
Improve tokenization
Add lower bound on how many times a word appears
Normalization
Spelling
Stemming
Lemmatization
Lower/Upper case
Stop words: discarding words that appear to frequently to be informative (E.g. “a”, “the”, “i”)
Term frequency–inverse document frequency (TF–IDF): give high weight to any term that appears often in a particular document, but not in many documents
TF-IDF = TF * IDF
Term frequency (TF): the number of times a word appears in a document
Inverse document frequency (IDF): total documents / number of documents that the term appears in
TF-IDF gives more weight to words the distinguish documents
Low TF-IDF = words appears in many documents
High TF-IDF = word appears in select documents often
n-Grams: overlapping sequence of words that preserves some order
Latent Dirichlet Allocation (LDA): finds groups of words/topics
that appear frequently together
- Clustering algorithm
- Each document is a “mixture” of a subset of the topics
L15: Naive Bayes
Two types of classifiers
Discriminative: learn p(y|X), learn output given input
Generative: learn p(y), learn p(X|y), create input given class
Discriminative learns the decision boundary
Generative: learns the distribution of the input data
Naive Bayes
Generative classifier
Uses class probability and data point given class probability to predict class given data point
Bayes theorem used to flip conditional probability
Different naive bayes classifiers depending on assumed data point given class probability distribution
Pros
When assumptions match data
Good for well-separated data, high dimensional data
Fast and easily interpretable
Cons
Poor when a lot of features equals zero
Correlation between attributes can not be captured
L16: Decision Trees
Decision tree (DT): a tree where each node is a decision
Performs both classification and regression
Questions are used to make a decision at each node
Minimize impurity
Impurity: Gini-index and cross-entropy
DTs learn the decision boundary by recursively partitioning the space in a manner that maximizes the information gain
Reducing overfitting
Limit maximum depth
Limit number of leaf nodes
Set min number of points needed to satisfy condition before splitting
Regression tree: DTs but for regression
Impurity: mean square error or mean absolute error
Predict mean
Ensembles
Ensembles: combining different models to achieve better performance (Frankenstein models)
Voting classifier: uses votes from each model to classify data
Hard vote: uses discrete votes
Soft vote: uses probabilities
Averaging methods
Build several ensembles from random subsets of the original dataset and aggregate their predictions
Bagging: random subsets of the samples with replacement
Bootstrap Aggregation (Bagging): generic way to build different models
Random forests: a collection of decision tree trained on different subsets of samples and features
Boosting methods
Boosting: refers to a family of algorithms which converts weak learner to strong learners
L17: Model Evaluation
Model evaluation
Cross validation
The data is split repeatedly and multiple models are trained
k-fold cross-validation: split data into k partitions and train k times
Grid search
Trying all possible combinations of the parameters of interest