Vector dimensions:

- $w$ is $(d,1)$ (weights)
- $y$ is $(n,1)$ (targets)
- $x_{i}$ is $(d,1)$ (features)
- $X$ is $(n,d)$ each row is $x_{i}$

Linear regression makes predictions $y^ _{i}$ using a linear function of $x_{i}$: $y^ _{i}=w_{T}x_{i}$

We set $w$ to minimize the sum of squared errors: $f(w)=∑_{i=1}(w_{T}x_{i}−y_{i})_{2}$

- Take the derivative of $f$ and set it equal to 0 $f_{′}(w)=0$ gives us $w=∑_{i=1}x_{i}∑_{i=1}x_{i}y_{i} $
- Check to second derivative to make sure we have a minimizer (if double derivative is positive). $f_{′′}(w)=∑_{i=1}x_{i}$. As $x_{i}$ by definition must always be positive, this is a minimizer.

In d-dimensions, we minimize

where $A$ is a matrix, $b$ is a vector, and $c$ is a scalar

The generalized version of “set the derivative to 0 and solve” in d-dimensions is to find where the gradient is zero (see calculus). We get

We can fit to polynomial equations using a change of basis

## Cost

Of solving equations in the form $Aw=b$

- $O(nd)$ to form vector $b$
- $O(nd_{2})$ to form matrix A
- Solving a $(d,d)$ system of equations is $O(d_{3})$

Overall cost is $O(nd_{2}+d_{3})$

## Robust Regression

We minimize the L1-norm of residuals instead of L2-norm

$f(w)=∥Xw−y∥_{1}$

However, as the L1-norm uses the absolute function, it is non-differentiable at 0. We can use a smooth approximation of the L1-norm instead, like Huber loss:

Absolute error is more robust and non-convex errors are the most robust.

- Generally not influenced by outlier groups
- But it is non-convex so finding global minimum is hard

## Brittle Regression

You want to minimize size of worst error across examples. For example, if in worst case the plane can crash or you perform badly on a group.

We can instead minimize the $L_{∞}$ norm which is convex but non-smooth. This effectively minimizes the highest error (effectively Minimax regret in DUI).

The smooth approximation to the max function is the log-sum-exp function:

$max_{i}{z_{i}}≈g(∑_{i}exp(z_{i}))$

## Penalizing Model Complexity

Optimize $score(p)=21 ∥Z_{p}v−y∥_{2}+p$ where $p$ is the degree of the polynomial.

Other ones also exist which replace the $p$ term with $λk$ where $k$ is the estimated degrees of freedom (for polynomials, $k=p+1$). $λ$ controls how strongly we penalize complexity.

$λ=1$ is called the Akaike information criterion (AIC)

See also: regularization