• Input: take features of examples and corresponding labels as inputs
  • Output: a model that can accurately predict the labels of new examples

Generally, the most successful machine learning technique (with the exception of games)



Decision treesNaive Bayes
# Features usedSequences of rules based on 1 featureAll features
Training, one pass per depth, just counting
New dataMay need to recompute treeJust update counts
AccuracyGood if simple rules based on individual features workGood if features almost independent given label
Interpretabilityeasy to see how decisions are madeeasy to see how each feature influences decision


  • is the input
  • are the class labels
  • is the number of examples (generally idnex is denoted by subscript)
  • is the number of features (generally index is denoted by superscript)
  • are the predictions

Parametric vs Non-parametric

  • Parametric models: have fixed number of parameters w.r.t.
  • Non-parametric models: number of parameters grows with

General Rules

  • We care far more about testing error than training error
  • Golden Rule: the test data cannot influence training the model in any way
  • Independent and Identically Distributed (IID) assumption
  • Fundamental trade-off between getting low training error and having training error approximate test error
  • Optimization bias
    • How biased is an “error” that we optimized over many possibilities?
    • Is large if you compare lots of different models, small if you only compare a few.

Decision Theory

Are we equally concerned about each potential outcome? Usually not! Sometimes, false negatives or false positives have outsized impact — the cost of mistakes might be different

  • Letting a spam message through (false negative) is not a big deal.
  • Filtering a not spam (false positive) message will make users mad

We can look to decision theory to help us here. Denote as the cost of predicting instead of the actual label .

Then, instead of predicting the most probable label, compute all possible actions and take the action with the lowest expected cost:

  • We assign the probability of each state to be the confidence of the predicted class (e.g. predicting ‘spam’ with 0.6 likelihood means we set the probability of true ‘spam’ to 0.6)