Maximum Likelihood Estimation (MLE)

Maximizes $\overset{w}{^} \in ar g max_{w} {P (D ∣ w)}$

Suppose we have a dataset $D$ with parameters $w$ . For example,

We flip a coin three times and get $D = {H, H, T}$
The parameter $w$ is the probability that this coin lands heads

The likelihood as a probability mass function $P (D ∣ w)$ . MLE is choosing a $\overset{w}{^}$ that maximizes the likelihood ( $\overset{w}{^} \in ar g max_{w} {P (D ∣ w)}$ )

In the case above, $\overset{w}{^}$ is $\frac{2}{3}$

Notation

argmin and argmax return the set of parameter values achieving the minimum and maximum values respectively. For example:

$ar g min_{w} {(w - 1)^{2}} \equiv {1}$

$ar g min_{w} {cos (w)} \equiv {\dots, - 2 π, 0, 2 π, \dots}$

We can also show that maximizing the MLE is equivalent to minimizing the negative log-likelihood. That is,

$\overset{w}{^} \in ar g max_{w} {\prod_{i = 1}^{n} P (D_{i} ∣ w)} \equiv ar g min_{w} {- \sum_{i = 1}^{n} lo g (P (D_{i} ∣ w))}$

This is true because logarithm is strictly monotonic so the location of the maximum doesn’t change if we take the logarithm. Changing the sign flips the max to the min.

This is typically easier to compute as it turns a product of probability into a sum.

Generative vs Discriminative

Discriminative maximizes $P (y ∣ X, w)$
- Least squares, robust linear regression, logistic regression fall under this category
- We don’t model X so we can use complicated features
Generative maximizes $P (y, X ∣ w)$
- Naive Bayes
- Needs to model X

Relation between loss functions

Least squares (squared L2-loss of residuals)

If we let the likelihood function of the labels be Gaussian:

$P (y_{i} ∣ x_{i} w) = \frac{1}{2 π} exp (- \frac{( w ^{T} x _{i} - y _{i} ) ^{2}}{2})$

Then the MLE of $w$ is the minimum of $f (w) = \frac{1}{2} ∥ Xw - y ∥^{2}$

Absolute error (L1-loss of residuals)

If we let the likelihood function of the labels be Laplacian:

$P (y_{i} ∣ x_{i} w) = \frac{1}{2} exp (- ∣ x^{T} x_{i} - y_{i} ∣)$

Then the MLE of $w$ is the minimum of $f (w) = ∥ Xw - y ∥_{1}$

Logistic loss

$h$ is the sigmoid function $\frac{1}{1 + e x p ( - x )}$ . If we let the likelihood function of the labels be

$P (y_{i} ∣ w, x_{i}) = h (y_{i} w^{T} x_{i}) = \frac{1}{1 + e x p ( - y _{i} w ^{T} x _{i} )}$

Then the MLE of $w$ is the NLL, which we can show to be equivalent to the logistic loss

$N LL (w) = \sum_{i = 1}^{n} lo g (\frac{1}{1 + e x p ( - y _{i} w ^{T} x _{i} )}) = \sum_{i = 1}^{n} lo g (1 + exp (- y_{i} w^{T} x_{i}))$

Last part is true because of log rules ( $- lo g (\frac{1}{x}) = lo g (x)$ ).

Overfitting

Conceptually, MLE is saying that we should find the $w$ that makes $D$ have the highest probability given $w$ . From No Free Lunch Theorem, we know that there is always a model that performs well for some unlikely $w$ . This is overfitting!

We actually want to find the $w$ that has the highest probability given the data $D$ . For this, we need MAP

jzhao.xyz

Table of Contents