Supervised learning

Input: take features of examples and corresponding labels as inputs
Output: a model that can accurately predict the labels of new examples

Generally, the most successful machine learning technique (with the exception of games)

Examples:

Tradeoffs:

	Decision trees	Naive Bayes
# Features used	Sequences of rules based on 1 feature	All features
Training	$O (d n)$ , one pass per depth	$O (n)$ , just counting
New data	May need to recompute tree	Just update counts
Accuracy	Good if simple rules based on individual features work	Good if features almost independent given label
Interpretability	easy to see how decisions are made	easy to see how each feature influences decision

Notation

$X$ is the input
$y$ are the class labels
$n$ is the number of examples (generally idnex is denoted by subscript)
$d$ is the number of features (generally index is denoted by superscript)
$\overset{y}{^}$ are the predictions

Parametric vs Non-parametric

Parametric models: have fixed number of parameters w.r.t. $n$
Non-parametric models: number of parameters grows with $n$

General Rules

We care far more about testing error than training error
Golden Rule: the test data cannot influence training the model in any way
Independent and Identically Distributed (IID) assumption
Fundamental trade-off between getting low training error and having training error approximate test error
- We can mitigate this by penalizing model complexity (e.g. for Penalizing Model Complexity)
- See also: regularization
Optimization bias
- How biased is an “error” that we optimized over many possibilities?
- Is large if you compare lots of different models, small if you only compare a few.

Decision Theory

Are we equally concerned about each potential outcome? Usually not! Sometimes, false negatives or false positives have outsized impact — the cost of mistakes might be different

Letting a spam message through (false negative) is not a big deal.
Filtering a not spam (false positive) message will make users mad

We can look to decision theory to help us here. Denote $cos t (\overset{y}{^}_{i}, \tilde{y}_{i})$ as the cost of predicting $\overset{y}{^}_{i}$ instead of the actual label $\tilde{y}_{i}$ .

Then, instead of predicting the most probable label, compute all possible actions and take the action with the lowest expected cost: $E [cos t (\overset{y}{^}_{i}, \tilde{y}_{i})]$

We assign the probability of each state to be the confidence of the predicted class (e.g. predicting ‘spam’ with 0.6 likelihood means we set the probability of true ‘spam’ to 0.6)

jzhao.xyz

Recent Writing

2024: Centering

Taste is a guide for what is worthwhile

Agentic Computing

Building a BFT JSON CRDT

Recent Notes

TrueTime

Concurrency control