- Input: take features of examples and corresponding labels as inputs
- Output: a model that can accurately predict the labels of new examples
Generally, the most successful machine learning technique (with the exception of games)
|Decision trees||Naive Bayes|
|# Features used||Sequences of rules based on 1 feature||All features|
|Training||, one pass per depth||, just counting|
|New data||May need to recompute tree||Just update counts|
|Accuracy||Good if simple rules based on individual features work||Good if features almost independent given label|
|Interpretability||easy to see how decisions are made||easy to see how each feature influences decision|
- is the input
- are the class labels
- is the number of examples (generally idnex is denoted by subscript)
- is the number of features (generally index is denoted by superscript)
- are the predictions
Parametric vs Non-parametric
- Parametric models: have fixed number of parameters w.r.t.
- Non-parametric models: number of parameters grows with
- We care far more about testing error than training error
- Golden Rule: the test data cannot influence training the model in any way
- Independent and Identically Distributed (IID) assumption
- Fundamental trade-off between getting low training error and having training error approximate test error
- Optimization bias
- How biased is an “error” that we optimized over many possibilities?
- Is large if you compare lots of different models, small if you only compare a few.
Are we equally concerned about each potential outcome? Usually not! Sometimes, false negatives or false positives have outsized impact — the cost of mistakes might be different
- Letting a spam message through (false negative) is not a big deal.
- Filtering a not spam (false positive) message will make users mad
We can look to decision theory to help us here. Denote as the cost of predicting instead of the actual label .
Then, instead of predicting the most probable label, compute all possible actions and take the action with the lowest expected cost:
- We assign the probability of each state to be the confidence of the predicted class (e.g. predicting ‘spam’ with 0.6 likelihood means we set the probability of true ‘spam’ to 0.6)