Vector dimensions:

• is (weights)
• is (targets)
• is (features)
• is each row is

Linear regression makes predictions using a linear function of :

We set to minimize the sum of squared errors:

1. Take the derivative of and set it equal to 0 gives us
2. Check to second derivative to make sure we have a minimizer (if double derivative is positive). . As by definition must always be positive, this is a minimizer.

In d-dimensions, we minimize

where is a matrix, is a vector, and is a scalar

The generalized version of “set the derivative to 0 and solve” in d-dimensions is to find where the gradient is zero (see calculus). We get

We can fit to polynomial equations using a change of basis

## Cost

Of solving equations in the form

1. to form vector
2. to form matrix A
3. Solving a system of equations is

Overall cost is

## Robust Regression

We minimize the L1-norm of residuals instead of L2-norm

However, as the L1-norm uses the absolute function, it is non-differentiable at 0. We can use a smooth approximation of the L1-norm instead, like Huber loss:

Absolute error is more robust and non-convex errors are the most robust.

• Generally not influenced by outlier groups
• But it is non-convex so finding global minimum is hard

## Brittle Regression

You want to minimize size of worst error across examples. For example, if in worst case the plane can crash or you perform badly on a group.

We can instead minimize the norm which is convex but non-smooth. This effectively minimizes the highest error (effectively Minimax regret in DUI).

The smooth approximation to the max function is the log-sum-exp function:

## Penalizing Model Complexity

Optimize where is the degree of the polynomial.

Other ones also exist which replace the term with where is the estimated degrees of freedom (for polynomials, ). controls how strongly we penalize complexity.

is called the Akaike information criterion (AIC)