One vs All

Suppose we only know how to do probabilistic binary classification. But we have classes we want to distinguish between. We can

  • For each class , train binary classifier to predict whether example is a .
    • This creates binary classifiers
  • On prediction, apply the binary classifiers to get a “score” for each class .
  • Predict the with the highest score (argmax)

This divides the space into convex regions (much like K-means)

Notation: denotes a classifier where


Problem: how can we define a loss that encourages the largest to be ?

Basically, for each we want

  • for all that is not the correct
  • We write to avoid the strict inequality
  • This is the constraint we use!


Penalizes each that violates the constraint


Penalizes the that violates the constraint the most

Adding L2-regularization turns both into multi-class SVMs. Both are convex upper bounds on the 0-1 loss.

Multi-class Logistic Regression

Similarly, we can smooth out the max with the log-sum-exp trick to get something differentiable and convex:

We can rewrite the last term as (the Frobenius norm of the matrix ).

This is equivalent to binary logistic loss for

See also: multi-class probabilities