Change of basis

Effectively by constructing new features that take the variable to certain powers. To get a y-intercept (bias), we just raise $x$ to the 0th power to get 1. We can fit polynomials of degree $p$ by raising other powers:

Z = 11 ⋮ 1 x_{1} x_{2} ⋮ x_{n} x_{1}^{2} x_{2}^{2} ⋮ x_{n}^{2} \dots \dots ⋱ \dots x_{1}^{p} x_{2}^{p} ⋮ x_{n}^{p}

As the polynomial degree increases, the training error goes down but the approximation error goes up.

Choosing a basis is hard! We can do something like Gaussian radial basis functions (RBFs) or polynomial basis as these are both universal approximators given enough data.

Kernel Trick

Let $Z$ be the basis. With multi-dimensional polynomial bases, actually forming $Z$ which is $k = O (d^{p})$ is intractable.

Represent each column of $Z$ as a unique term. For example, with an $X$ of $d = 2$ , we can use $p = 2$ to get

We compute $u = (K + λ I)^{- 1} y$

and for testing:

\overset{y}{^} = \tilde{Z} v = \tilde{Z} Z^{T} (Z Z^{T} + λ I)^{- 1} y = \tilde{K} (K + λ I)^{- 1} y = \tilde{K} u minimum of L2-regularized least squares: v = Z^{T} (Z Z^{T} + λ I)^{- 1} y u is a (n,1) of kernel weights we learn

The key idea behind “kernel trick” for certain bases (like polynomials) is that we can efficiently compute $K$ and $\tilde{K}$ even though forming $Z$ and $\tilde{Z}$ is intractable.

We call $K = Z Z^{T}$ the Gram Matrix.

Finally, we call the general degree-p polynomial kernel function $K_{ij} = k (x_{i}, x_{j}) = (1 + x_{i}^{T} x_{j})^{p}$ . Computing $k$ is only $O (d)$ time instead of $O (d^{p})$ .

Thus, computing $K$ is $O (n^{2} d + n^{3})$ :

Forming $K$ takes $O (n^{2} d)$ time
Inverting $K + λ I$ which is a $(n, n)$ takes $O (n^{3})$

All of our distance-based methods have kernel versions.

Learned Basis

We can also learn basis from data as well. See latent-factor model

jzhao.xyz

Recent Writing

2024: Centering

Taste is a guide for what is worthwhile

Agentic Computing

Building a BFT JSON CRDT

Recent Notes

TrueTime

Concurrency control