Search

# Supervised learning

Last updated Sep 9, 2022 Edit Source

• Input: take features of examples and corresponding labels as inputs
• Output: a model that can accurately predict the labels of new examples

Generally, the most successful machine learning technique (with the exception of games)

Examples:

Decision trees Naive Bayes
# Features used Sequences of rules based on 1 feature All features
Training $O(dn)$, one pass per depth $O(n)$, just counting
New data May need to recompute tree Just update counts
Accuracy Good if simple rules based on individual features work Good if features almost independent given label
Interpretability easy to see how decisions are made easy to see how each feature influences decision

## # Notation

• $X$ is the input
• $y$ are the class labels
• $n$ is the number of examples (generally idnex is denoted by subscript)
• $d$ is the number of features (generally index is denoted by superscript)
• $\hat y$ are the predictions

## # Parametric vs Non-parametric

• Parametric models: have fixed number of parameters w.r.t. $n$
• Non-parametric models: number of parameters grows with $n$

## # General Rules

• We care far more about testing error than training error
• Golden Rule: the test data cannot influence training the model in any way
• Independent and Identically Distributed (IID) assumption
• Fundamental trade-off between getting low training error and having training error approximate test error
• $E_{test} = (E_{test} - E_{train}) + E_{train}$ where $E_{test} - E_{train} = E_{approx}$ is the approximation error or the amount of overfitting
• As the model gets complicated, $E_{train}$ goes down but $E_{approx}$ goes up
• We can mitigate this by penalizing model complexity (e.g. for linear regression)
• Optimization bias
• How biased is an “error” that we optimized over many possibilities?
• Is large if you compare lots of different models, small if you only compare a few.

## # Decision Theory

Are we equally concerned about each potential outcome? Usually not! Sometimes, false negatives or false positives have outsized impact – the cost of mistakes might be different

• Letting a spam message through (false negative) is not a big deal.
• Filtering a not spam (false positive) message will make users mad

We can look to decision theory to help us here. Denote $cost(\hat y_i, \tilde y_i)$ as the cost of predicting $\hat y_i$ instead of the actual label $\tilde y_i$.

Then, instead of predicting the most probable label, compute all possible actions and take the action with the lowest expected cost: $\mathbb E [cost(\hat y_i, \tilde y_i)]$

• We assign the probability of each state to be the confidence of the predicted class (e.g. predicting ‘spam’ with 0.6 likelihood means we set the probability of true ‘spam’ to 0.6)