Effectively by constructing new features that take the variable to certain powers. To get a y-intercept (bias), we just raise $x$ to the 0th power to get 1. We can fit polynomials of degree $p$ by raising other powers:

$Z= 11⋮1 x_{1}x_{2}⋮x_{n} x_{1}x_{2}⋮x_{n} ……⋱… x_{1}x_{2}⋮x_{n} $As the polynomial degree increases, the training error goes down but the approximation error goes up.

Choosing a basis is hard! We can do something like Gaussian radial basis functions (RBFs) or polynomial basis as these are both universal approximators given enough data.

## Kernel Trick

Let $Z$ be the basis. With multi-dimensional polynomial bases, actually forming $Z$ which is $k=O(d_{p})$ is intractable.

Represent each column of $Z$ as a unique term. For example, with an $X$ of $d=2$, we can use $p=2$ to get

We compute $u=(K+λI)_{−1}y$

and for testing:

$y^ =Z~v=Z~Z_{T}(ZZ_{T}+λI)_{−1}y=K~(K+λI)_{−1}y=K~u minimum of L2-regularized least squares:v=Z_{T}(ZZ_{T}+λI)_{−1}yuis a (n,1) of kernel weights we learn $The key idea behind “kernel trick” for certain bases (like polynomials) is that we **can** efficiently compute $K$ and $K~$ even though forming $Z$ and $Z~$ is intractable.

We call $K=ZZ_{T}$ the Gram Matrix.

Finally, we call the general degree-p polynomial kernel function $K_{ij}=k(x_{i},x_{j})=(1+x_{i}x_{j})_{p}$. Computing $k$ is only $O(d)$ time instead of $O(d_{p})$.

Thus, computing $K$ is $O(n_{2}d+n_{3})$:

- Forming $K$ takes $O(n_{2}d)$ time
- Inverting $K+λI$ which is a $(n,n)$ takes $O(n_{3})$

All of our distance-based methods have kernel versions

## Learned Basis

We can also learn basis from data as well. See latent-factor model