Multi-class Classification

One vs All

Suppose we only know how to do probabilistic binary classification. But we have $k$ classes we want to distinguish between. We can

For each class $c$ , train binary classifier to predict whether example is a $c$ .
- This creates $k$ binary classifiers
On prediction, apply the $k$ binary classifiers to get a “score” for each class $c$ .
Predict the $c$ with the highest score (argmax)

This divides the space into convex regions (much like K-means)

Notation: $w_{y_{i}}$ denotes a classifier where $c = y_{i}$

Loss

Problem: how can we define a loss that encourages the largest $w_{c}^{T} x_{i}$ to be $w_{y_{i}}^{T} x_{i}$ ?

Basically, for each $x_{i}$ we want

$w_{y_{i}}^{T} x_{i} > w_{c}^{T} x_{i}$ for all $c$ that is not the correct $y_{i}$
We write $w_{y_{i}}^{T} x_{i} \geq w_{c}^{T} x_{i} + 1$ to avoid the strict inequality
This is the constraint we use!

Sum

Penalizes each $c$ that violates the constraint

$\sum_{c \neq = y_{i}} max {0, 1 - w_{y_{i}}^{T} x_{i} + w_{c}^{T} x_{i}}$

Max

Penalizes the $c$ that violates the constraint the most

$max_{c \neq = y_{i}} {max {0, 1 - w_{y_{i}}^{T} x_{i} + w_{c}^{T} x_{i}}}$

Adding L2-regularization turns both into multi-class SVMs. Both are convex upper bounds on the 0-1 loss.

Multi-class Logistic Regression

Similarly, we can smooth out the max with the log-sum-exp trick to get something differentiable and convex:

$f (w) = \sum_{i = 1}^{n} [- w_{y_{i}}^{T} x_{I} + lo g (\sum_{c = 1}^{k} exp (w_{c}^{T} x_{i}))] + \frac{λ}{2} \sum_{c = 1}^{k} \sum_{j = 1}^{d} w_{c j}^{2}$

We can rewrite the last term as $∥ W ∥_{F}^{2}$ (the Frobenius norm of the matrix $W$ ).

This is equivalent to binary logistic loss for $k = 2$

jzhao.xyz

Recent Writing

2024: Centering

Taste is a guide for what is worthwhile

Agentic Computing

Building a BFT JSON CRDT

Recent Notes

TrueTime

Concurrency control