One vs All
Suppose we only know how to do probabilistic binary classification. But we have classes we want to distinguish between. We can
- For each class , train binary classifier to predict whether example is a .
- This creates binary classifiers
- On prediction, apply the binary classifiers to get a “score” for each class .
- Predict the with the highest score (argmax)
This divides the space into convex regions (much like K-means)
Notation: denotes a classifier where
Loss
Problem: how can we define a loss that encourages the largest to be ?
Basically, for each we want
- for all that is not the correct
- We write to avoid the strict inequality
- This is the constraint we use!
Sum
Penalizes each that violates the constraint
Max
Penalizes the that violates the constraint the most
Adding L2-regularization turns both into multi-class SVMs. Both are convex upper bounds on the 0-1 loss.
Multi-class Logistic Regression
Similarly, we can smooth out the max with the log-sum-exp trick to get something differentiable and convex:
We can rewrite the last term as (the Frobenius norm of the matrix ).
This is equivalent to binary logistic loss for
See also: multi-class probabilities