- Input: take features of examples and corresponding labels as inputs
- Output: a model that can accurately predict the labels of new examples

Generally, the most successful machine learning technique (with the exception of games)

Examples:

- Decision trees
- Naive Bayes
- KNN (to fit an appropriate $k$)
- Ensemble Methods
- linear regression

Tradeoffs:

Decision trees | Naive Bayes | |
---|---|---|

# Features used | Sequences of rules based on 1 feature | All features |

Training | $O(dn)$, one pass per depth | $O(n)$, just counting |

New data | May need to recompute tree | Just update counts |

Accuracy | Good if simple rules based on individual features work | Good if features almost independent given label |

Interpretability | easy to see how decisions are made | easy to see how each feature influences decision |

## Notation

- $X$ is the input
- $y$ are the class labels
- $n$ is the number of examples (generally idnex is denoted by subscript)
- $d$ is the number of features (generally index is denoted by superscript)
- $y^ $ are the predictions

## Parametric vs Non-parametric

- Parametric models: have fixed number of parameters w.r.t. $n$
- Non-parametric models: number of parameters grows with $n$

## General Rules

- We care far more about testing error than training error
- Golden Rule: the test data cannot influence training the model in any way
- Independent and Identically Distributed (IID) assumption
- Fundamental trade-off between getting low training error and having training error approximate test error
- We can mitigate this by penalizing model complexity (e.g. for Penalizing Model Complexity)
- See also: regularization

- Optimization bias
- How biased is an “error” that we optimized over many possibilities?
- Is large if you compare lots of different models, small if you only compare a few.

## Decision Theory

Are we equally concerned about each potential outcome? Usually not! Sometimes, false negatives or false positives have outsized impact — the cost of mistakes might be different

- Letting a spam message through (false negative) is not a big deal.
- Filtering a not spam (false positive) message will make users mad

We can look to decision theory to help us here. Denote $cost(y^ _{i},y~ _{i})$ as the cost of predicting $y^ _{i}$ instead of the actual label $y~ _{i}$.

Then, instead of predicting the most probable label, compute all possible actions and take the action with the lowest expected cost: $E[cost(y^ _{i},y~ _{i})]$

- We assign the probability of each state to be the confidence of the predicted class (e.g. predicting ‘spam’ with 0.6 likelihood means we set the probability of true ‘spam’ to 0.6)