jzhao.xyz

Search

Search IconIcon to open search

Neural networks

Last updated Dec 3, 2022 Edit Source

See also: convolutional neural networks

# Shallow Networks

Many domains require non-linear transforms of the features (see: change of basis). Usually not obvious which transform to use.

Neural network models try to learn good transformations. Whereas latent-factor models train the embedding and model separately, neural networks learn both features and the model at the same time.

Let $k$ be the number of hidden units. Generally, $\hat y_i = v^Th(Wx_i)$ (or, with bias, $\hat y_i = \sum_{c=1}^k v_ch(w_c^Tx_i + \beta_c) + \beta$)

Artificial neural network:

Parameters: the (k,d) matrix $W$, and (k) vector $v$. To turn this into multi-class classification, we modify $v$ into a (k’, k) matrix (where k’ is the number of classes) and convert to probabilities by computing the softmax of the $\hat y_c$ values

Losses:

# Training

Generally non-convex as W and v are both variables. As such, finding the global optimum is NP-Hard. We can use gradient descent but this is not guaranteed to reach a global optimum due to non-convexity.

# Implicit Regularization

Often, increasing $k$, the number of hidden units, improves test error. This seems at odds with the fundamental tradeoff, doesn’t it?

However, learning theory (trade-off) results analyze global min with worst test error. The actual test error for different global minima will be better than worst case bound. Among the global minima, SGD is somehow converging to “good” ones! Empirically, using SGD is like using L2-Regularization, but the regularization is “implicit”.

With small models, “minimize training error” leads to unique (or similar) global mins. With larger models, there is a lot of flexibility in the space of global mins (gap between best/worst).

We get results that look like the following:

# Deep Learning

Instead of a single layer of hidden units, we can stack them.

$$\begin{aligned} \hat y_i &= v^Th(W^{(m)}h(W^{(m-1)}h(\dots W^{(1)}x_i))) \\ &= v^T(I_{l=1}^mh(W^{(l)}x_i)) & \textrm{Where } I \textrm{ is repeated function composition} \end{aligned}$$

# Vanishing Gradient Problem

The gradient of the sigmoid function away from the origin is nearly zero. This is worse when you take the sigmoid of a sigmoid of a sigmoid…

If these are numerically set to 0 because of how small they are, gradient descent will not make progress

This is partially solved by replacing the sigmoid activation with the ReLU activation. Alternatively, can also use skip connections that ‘shortcuts’ between layers

# Philosophy of Deep Learning

# So why are they so effective?

Cognition and Intelligence

Potemkin village analogy for approximating intelligence.

# Brain-like networks

# Differences