Linear Regression

Vector dimensions:

$w$ is $(d, 1)$ (weights)
$y$ is $(n, 1)$ (targets)
$x_{i}$ is $(d, 1)$ (features)
$X$ is $(n, d)$ each row is $x_{i}^{T}$

Linear regression makes predictions $\overset{y}{^}_{i}$ using a linear function of $x_{i}$ : $\overset{y}{^}_{i} = w^{T} x_{i}$

We set $w$ to minimize the sum of squared errors: $f (w) = \sum_{i = 1}^{n} (w^{T} x_{i} - y_{i})^{2}$

Take the derivative of $f$ and set it equal to 0 $f^{'} (w) = 0$ gives us $w = \frac{\sum _{i = 1}^{n} x _{i} y _{i}}{\sum _{i = 1}^{n} x _{i}^{2}}$
Check to second derivative to make sure we have a minimizer (if double derivative is positive). $f^{''} (w) = \sum_{i = 1}^{n} x_{i}^{2}$ . As $x_{i}^{2}$ by definition must always be positive, this is a minimizer.

In d-dimensions, we minimize

f (w) = \frac{1}{2} i = 1 \sum n (w^{T} x_{i} - y_{i})^{2} = \frac{1}{2} ∥ Xw - y ∥^{2} = \frac{1}{2} w^{T} X^{T} Xw - w^{T} X^{T} y + \frac{1}{2} y^{T} y = \frac{1}{2} w^{T} A w - w^{T} b + c

where $A$ is a matrix, $b$ is a vector, and $c$ is a scalar

The generalized version of “set the derivative to 0 and solve” in d-dimensions is to find where the gradient is zero (see calculus). We get

\nabla f (w) = \frac{\partial f}{\partial w _{1}} \frac{\partial f}{\partial w _{2}} ⋮ \frac{\partial f}{\partial w _{d}} = \sum_{i = 1}^{n} (w^{T} x_{i} - y i) x_{i, 1} \sum_{i = 1}^{n} (w^{T} x_{i} - y i) x_{i, 2} ⋮ \sum_{i = 1}^{n} (w^{T} x_{i} - y i) x_{i, d} = A w - b = X^{T} Xw - X^{T} y

We can fit to polynomial equations using a change of basis

Cost

Of solving equations in the form $A w = b$

$O (n d)$ to form vector $b$
$O (n d^{2})$ to form matrix A
Solving a $(d, d)$ system of equations is $O (d^{3})$

Overall cost is $O (n d^{2} + d^{3})$

Robust Regression

We minimize the L1-norm of residuals instead of L2-norm

$f (w) = ∥ Xw - y ∥_{1}$

However, as the L1-norm uses the absolute function, it is non-differentiable at 0. We can use a smooth approximation of the L1-norm instead, like Huber loss:

h (r_{i}) = {\frac{1}{2} r_{i}^{2} ϵ (∣ r_{i} ∣ - \frac{1}{2} ϵ) ∣ r_{i} ∣ \leq ϵ otherwise

Absolute error is more robust and non-convex errors are the most robust.

Generally not influenced by outlier groups
But it is non-convex so finding global minimum is hard

Brittle Regression

You want to minimize size of worst error across examples. For example, if in worst case the plane can crash or you perform badly on a group.

We can instead minimize the $L_{\infty}$ norm which is convex but non-smooth. This effectively minimizes the highest error (effectively Minimax regret in DUI).

The smooth approximation to the max function is the log-sum-exp function:

$max_{i} {z_{i}} \approx lo g (\sum_{i} exp (z_{i}))$

Penalizing Model Complexity

Optimize $score (p) = \frac{1}{2} ∥ Z_{p} v - y ∥^{2} + p$ where $p$ is the degree of the polynomial.

Other ones also exist which replace the $p$ term with $λk$ where $k$ is the estimated degrees of freedom (for polynomials, $k = p + 1$ ). $λ$ controls how strongly we penalize complexity.

$λ = 1$ is called the Akaike information criterion (AIC)

jzhao.xyz

Recent Writing

2024: Centering

Taste is a guide for what is worthwhile

Agentic Computing

Building a BFT JSON CRDT

Recent Notes

TrueTime

Concurrency control