## One vs All

Suppose we only know how to do probabilistic binary classification. But we have $k$ classes we want to distinguish between. We can

- For each class $c$, train binary classifier to predict whether example is a $c$.
- This creates $k$ binary classifiers

- On prediction, apply the $k$ binary classifiers to get a “score” for each class $c$.
- Predict the $c$ with the highest score (argmax)

This divides the space into convex regions (much like K-means)

Notation: $w_{y_{i}}$ denotes a classifier where $c=y_{i}$

## Loss

Problem: how can we define a loss that encourages the largest $w_{c}x_{i}$ to be $w_{y_{i}}x_{i}$?

Basically, for each $x_{i}$ we want

- $w_{y_{i}}x_{i}>w_{c}x_{i}$ for all $c$ that is not the correct $y_{i}$
- We write $w_{y_{i}}x_{i}≥w_{c}x_{i}+1$ to avoid the strict inequality
- This is the constraint we use!

### Sum

Penalizes each $c$ that violates the constraint

$∑_{c=y_{i}}max{0,1−w_{y_{i}}x_{i}+w_{c}x_{i}}$

### Max

Penalizes the $c$ that violates the constraint the most

$max_{c=y_{i}}{max{0,1−w_{y_{i}}x_{i}+w_{c}x_{i}}}$

Adding L2-regularization turns both into multi-class SVMs. Both are convex upper bounds on the 0-1 loss.

## Multi-class Logistic Regression

Similarly, we can smooth out the max with the log-sum-exp trick to get something differentiable and convex:

$f(w)=∑_{i=1}[−w_{y_{i}}x_{I}+g(∑_{c=1}exp(w_{c}x_{i}))]+2λ ∑_{c=1}∑_{j=1}w_{cj}$

We can rewrite the last term as $∥W∥_{F}$ (the Frobenius norm of the matrix $W$).

This is equivalent to binary logistic loss for $k=2$

See also: multi-class probabilities