Linear Regression
Vector dimensions:
- $w$ is $(d, 1)$ (weights)
- $y$ is $(n,1)$ (targets)
- $x_i$ is $(d, 1)$ (features)
- $X$ is $(n,d)$ each row is $x_i^T$
Linear regression makes predictions $\hat y_i$ using a linear function of $x_i$: $\hat y_i = w^Tx_i$
We set $w$ to minimize the sum of squared errors: $f(w) = \sum_{i=1}^n (w^Tx_i - y_i)^2$
- Take the derivative of $f$ and set it equal to 0 $f’(w) = 0$ gives us $w = \frac{\sum_{i=1}^n x_iy_i}{\sum_{i=1}^n x_i^2}$
- Check to second derivative to make sure we have a minimizer (if double derivative is positive). $f’’(w) = \sum_{i=1}^n x_i^2$. As $x_i^2$ by definition must always be positive, this is a minimizer.
In d-dimensions, we minimize
$$\begin{equation} \begin{split} f(w) &= \frac 1 2 \sum_{i=1}^n (w^Tx_i - y_i)^2 \\ & = \frac 1 2 \lVert Xw - y \rVert^2 \\ & = \frac 1 2 w^TX^TXw - w^TX^Ty + \frac 1 2 y^T y \\ & = \frac 1 2 w^TAw - w^Tb + c \end{split} \end{equation}$$
where $A$ is a matrix, $b$ is a vector, and $c$ is a scalar
The generalized version of “set the derivative to 0 and solve” in d-dimensions is to find where the gradient is zero (see calculus). We get
$$ \begin{equation} \begin{split} \nabla f(w) &= \begin{bmatrix} \frac{\partial f}{\partial w_1} \\ \frac{\partial f}{\partial w_2} \\ \vdots\\\ \frac{\partial f}{\partial w_d} \end{bmatrix} \\ \
&=
\begin{bmatrix} \sum_{i=1}^n (w^Tx_i - yi)x_{i,1} \\ \sum_{i=1}^n (w^Tx_i - yi)x_{i,2} \\ \vdots\\\ \sum_{i=1}^n (w^Tx_i - yi)x_{i,d} \\ \end{bmatrix} \\ \
&=
Aw - b \
&= X^TXw - X^Ty \end{split} \end{equation} $$
We can fit to polynomial equations using a change of basis
# Cost
Of solving equations in the form $Aw = b$
- $O(nd)$ to form vector $b$
- $O(nd^2)$ to form matrix A
- Solving a $(d,d)$ system of equations is $O(d^3)$
Overall cost is $O(nd^2+d^3)$
# Robust Regression
We minimize the L1-norm of residuals instead of L2-norm
$$f(w) = \lVert Xw - y \rVert_1$$
However, as the L1-norm uses the absolute function, it is non-differentiable at 0. We can use a smooth approximation of the L1-norm instead, like Huber loss:
$$ h(r_i) = \begin{cases} \frac 1 2 r_i^2 & |r_i| \leq \epsilon \\ \epsilon (|r_i| - \frac 1 2 \epsilon) & \textrm{otherwise} \end{cases} $$
Absolute error is more robust and non-convex errors are the most robust.
- Generally not influenced by outlier groups
- But it is non-convex so finding global minimum is hard
# Brittle Regression
You want to minimize size of worst error across examples. For example, if in worst case the plane can crash or you perform badly on a group.
We can instead minimize the $L_\infty$ norm which is convex but non-smooth. This effectively minimizes the highest error (effectively Minimax regret in DUI).
The smooth approximation to the max function is the log-sum-exp function:
$$\max_i { z_i } \approx \log( \sum_i \exp(z_i))$$
# Penalizing Model Complexity
Optimize $score(p) = \frac 1 2 \lVert Z_p v - y \rVert^2 + p$ where $p$ is the degree of the polynomial.
Other ones also exist which replace the $p$ term with $\lambda k$ where $k$ is the estimated degrees of freedom (for polynomials, $k = p + 1$). $\lambda$ controls how strongly we penalize complexity.
$\lambda = 1$ is called the Akaike information criterion (AIC)
See also: regularization