- Input: take features of examples and corresponding labels as inputs
- Output: a model that can accurately predict the labels of new examples
Generally, the most successful machine learning technique (with the exception of games)
Examples:
- Decision trees
- Naive Bayes
- KNN (to fit an appropriate )
- Ensemble Methods
- linear regression
Tradeoffs:
Decision trees | Naive Bayes | |
---|---|---|
# Features used | Sequences of rules based on 1 feature | All features |
Training | , one pass per depth | , just counting |
New data | May need to recompute tree | Just update counts |
Accuracy | Good if simple rules based on individual features work | Good if features almost independent given label |
Interpretability | easy to see how decisions are made | easy to see how each feature influences decision |
Notation
- is the input
- are the class labels
- is the number of examples (generally idnex is denoted by subscript)
- is the number of features (generally index is denoted by superscript)
- are the predictions
Parametric vs Non-parametric
- Parametric models: have fixed number of parameters w.r.t.
- Non-parametric models: number of parameters grows with
General Rules
- We care far more about testing error than training error
- Golden Rule: the test data cannot influence training the model in any way
- Independent and Identically Distributed (IID) assumption
- Fundamental trade-off between getting low training error and having training error approximate test error
- We can mitigate this by penalizing model complexity (e.g. for Penalizing Model Complexity)
- See also: regularization
- Optimization bias
- How biased is an “error” that we optimized over many possibilities?
- Is large if you compare lots of different models, small if you only compare a few.
Decision Theory
Are we equally concerned about each potential outcome? Usually not! Sometimes, false negatives or false positives have outsized impact — the cost of mistakes might be different
- Letting a spam message through (false negative) is not a big deal.
- Filtering a not spam (false positive) message will make users mad
We can look to decision theory to help us here. Denote as the cost of predicting instead of the actual label .
Then, instead of predicting the most probable label, compute all possible actions and take the action with the lowest expected cost:
- We assign the probability of each state to be the confidence of the predicted class (e.g. predicting ‘spam’ with 0.6 likelihood means we set the probability of true ‘spam’ to 0.6)