One vs All

Suppose we only know how to do probabilistic binary classification. But we have classes we want to distinguish between. We can

• For each class , train binary classifier to predict whether example is a .
• This creates binary classifiers
• On prediction, apply the binary classifiers to get a “score” for each class .
• Predict the with the highest score (argmax)

This divides the space into convex regions (much like K-means)

Notation: denotes a classifier where

Loss

Problem: how can we define a loss that encourages the largest to be ?

Basically, for each we want

• for all that is not the correct
• We write to avoid the strict inequality
• This is the constraint we use!

Sum

Penalizes each that violates the constraint

Max

Penalizes the that violates the constraint the most

Adding L2-regularization turns both into multi-class SVMs. Both are convex upper bounds on the 0-1 loss.

Multi-class Logistic Regression

Similarly, we can smooth out the max with the log-sum-exp trick to get something differentiable and convex:

We can rewrite the last term as (the Frobenius norm of the matrix ).

This is equivalent to binary logistic loss for