Vector dimensions:
- is (weights)
- is (targets)
- is (features)
- is each row is
Linear regression makes predictions using a linear function of :
We set to minimize the sum of squared errors:
- Take the derivative of and set it equal to 0 gives us
- Check to second derivative to make sure we have a minimizer (if double derivative is positive). . As by definition must always be positive, this is a minimizer.
In d-dimensions, we minimize
where is a matrix, is a vector, and is a scalar
The generalized version of “set the derivative to 0 and solve” in d-dimensions is to find where the gradient is zero (see calculus). We get
We can fit to polynomial equations using a change of basis
Cost
Of solving equations in the form
- to form vector
- to form matrix A
- Solving a system of equations is
Overall cost is
Robust Regression
We minimize the L1-norm of residuals instead of L2-norm
However, as the L1-norm uses the absolute function, it is non-differentiable at 0. We can use a smooth approximation of the L1-norm instead, like Huber loss:
Absolute error is more robust and non-convex errors are the most robust.
- Generally not influenced by outlier groups
- But it is non-convex so finding global minimum is hard
Brittle Regression
You want to minimize size of worst error across examples. For example, if in worst case the plane can crash or you perform badly on a group.
We can instead minimize the norm which is convex but non-smooth. This effectively minimizes the highest error (effectively Minimax regret in DUI).
The smooth approximation to the max function is the log-sum-exp function:
Penalizing Model Complexity
Optimize where is the degree of the polynomial.
Other ones also exist which replace the term with where is the estimated degrees of freedom (for polynomials, ). controls how strongly we penalize complexity.
is called the Akaike information criterion (AIC)
See also: regularization