We want a model of $P(y_{i}=important∣x_{i})$ for use in decision theory.

- Predictions generally map $w_{T}x_{i}$ to labels for classes (for binary prediction, we used $sign(x)$)
- Probabilities we want to map $w_{T}x_{i}$ to the range $[0,1]$

The most common choice is to use the sigmoid function:

$h(z_{i})=1+exp(−z_{i})1 $

## Multi-class Probabilities

See also: multi-class classification

The softmax function allows us to map $k$ real numbers $z_{i}=w_{c}x_{i}$ to probabilities.

$P(y∣z_{1},z_{2},…,z_{k})=∑_{c=1}exp(z_{c}))exp(z_{y}) $

The alternative ‘harder’ version to softmax is the argmax function which simply finds the maximum value, sets it to 1.0, and assigns 0.0 to all other values.

In contrast, the softmax operation serves as a “softer” version of that. Due to the exponentiation involved in softmax, the largest value is emphasized and pushed towards 1.0, while still maintaining a probability distribution over all input values. This allows for a more nuanced representation that captures not only the most likely option but also the relative likelihood of other options.