Search

# Neural networks

Last updated Dec 3, 2022 Edit Source

## # Shallow Networks

Many domains require non-linear transforms of the features (see: change of basis). Usually not obvious which transform to use.

Neural network models try to learn good transformations. Whereas latent-factor models train the embedding and model separately, neural networks learn both features and the model at the same time.

Let $k$ be the number of hidden units. Generally, $\hat y_i = v^Th(Wx_i)$ (or, with bias, $\hat y_i = \sum_{c=1}^k v_ch(w_c^Tx_i + \beta_c) + \beta$)

Artificial neural network:

• $x_i$ is measurement of the world
• $z_i$ is internal representation of world
• Each $h(z_i)$ can be viewed as binary feature: do we care about it or not?
• Use sigmoid as a smooth approximation
• $y_i$ is output of neuron for classification/regression

Parameters: the (k,d) matrix $W$, and (k) vector $v$. To turn this into multi-class classification, we modify $v$ into a (k’, k) matrix (where k’ is the number of classes) and convert to probabilities by computing the softmax of the $\hat y_c$ values

Losses:

### # Training

Generally non-convex as W and v are both variables. As such, finding the global optimum is NP-Hard. We can use gradient descent but this is not guaranteed to reach a global optimum due to non-convexity.

### # Implicit Regularization

Often, increasing $k$, the number of hidden units, improves test error. This seems at odds with the fundamental tradeoff, doesn’t it?

However, learning theory (trade-off) results analyze global min with worst test error. The actual test error for different global minima will be better than worst case bound. Among the global minima, SGD is somehow converging to “good” ones! Empirically, using SGD is like using L2-Regularization, but the regularization is “implicit”.

With small models, “minimize training error” leads to unique (or similar) global mins. With larger models, there is a lot of flexibility in the space of global mins (gap between best/worst).

We get results that look like the following:

## # Deep Learning

Instead of a single layer of hidden units, we can stack them.

\begin{aligned} \hat y_i &= v^Th(W^{(m)}h(W^{(m-1)}h(\dots W^{(1)}x_i))) \\ &= v^T(I_{l=1}^mh(W^{(l)}x_i)) & \textrm{Where } I \textrm{ is repeated function composition} \end{aligned}

The gradient of the sigmoid function away from the origin is nearly zero. This is worse when you take the sigmoid of a sigmoid of a sigmoid…

If these are numerically set to 0 because of how small they are, gradient descent will not make progress

This is partially solved by replacing the sigmoid activation with the ReLU activation. Alternatively, can also use skip connections that ‘shortcuts’ between layers

### # Philosophy of Deep Learning

• No universally accepted explanation as to why they work so well, just really a form of classification
• The “Golden age network” had 3 main properties
1. shallow → no more than three or four layers between input and output
2. uniform → only one type of node deploying a sigmoidal activation
3. fully connected → each node from a lower layer connected to each other in the next layer
• Depth, hierarchy of parts intuition
• Analogy of assembly line mass production of automobiles
• One person is skeptical of the significance of assembly lines → “any thing that can be made by the assembly line could, in theory, be made by a team of skilled machinists”
• Other person believes that the assembly line is more efficient, specialized, and reusable
• Each unit can grow increasingly specialized and better at a small range of simpler tasks reliably and efficiently
• Standardization of units across automobiles
• sum-product network example
• simple device for computing polynomial functions
• shallow networks → must compute the expanded expressions of that function (skilled but inefficient machinists)
• deep networks → can compute the factorized expression of the polynomial function
• show that they can compose simple operations
• heterogeneity
• different types of operations composed together
• dccns → conv layer followed by relu followed by max pooling
• good at detecting features in a variety of different locations/poses
• combining all three operations means we can product a simplified, transformed representation of the source image
• can get more complex/abstract as you move deeper through the layers
• sparse connectivity
• heuristic → only local pixels matter
• dramatically reduces number of learned parameters
• regularization
• input preturbations
• rotations/scaling/transformations
• noise
• dropout
• L1 regularization → favours simpler/sparser solutions by causing weights to fall to 0 if a large gradient is not maintained

#### # So why are they so effective?

• hierarchical feature composition
• vector space separation
• input can be realized as a feature space
• output can be realized as manifolds or regions in the feature space
• training is just then learning the manifolds/regions that create desired categories
• most commentators agree that current deep learning methods fall short of implementing general intelligence, and it remains an open question as to whether some modification of current deep learning methods will be able to do so -> question of intelligence
• self-learning algorithms like AlphaZero (which learns from self-play) seem to disprove/vindicate the empiricist approach (need real world experience to learn)
• counterargument is that systems like AlphaGo have built in knowledge about the rules of Go and mechanisms to explore possible outcomes one at a time (e.g. Monte Carlo Tree Search for the solution space)

#### Cognition and Intelligence

Potemkin village analogy for approximating intelligence.

#### # Brain-like networks

• biological similarities
• CNNs have high sensitivity to spots, edges, and bars in specific orientations
• echoes the work of hubel and wiesle (1962) which found similar patterns in the feline visual cortex
• can record a single neuron but very difficult to record patterns
• functional vector → vector that corresponds to one of the output classes
• speech example, network managed to recover phonetic hierarchical information
• both systems have created a system of internal representations that corresponds to important distinctions and structures in the outside world
• theories → representations that allow networks to “make sense” of their corpus and respond in a fashion that reduces error
• how do we explain ‘conceptual change’?
• knowing a creature’s vector-space partitions may suffice for short-term prediction of behaviour but inadequate to predict or explain the evolution of those partitions over the course of time
• just knowing output space partitions is not enough, but connection weights seems to provide a level that meets all of these conditions
• neural networks have decently high fault tolerance (some redundant neurons)
• may help to explain functional persistence of brains in the face of minor damage
• in a large network, a loss of a few neurons will not make a huge impact, but the quality of its computations will progressively degrade

#### # Differences

• real neural networks arent fully connected like ANNs
• real neural networks have horizontal cell-to-cell connections within a given layer which are not present in ANNs
• real brains don’t use backprop via generalized delta rule
• back prop requires
1. computing partial derivates to minimize error
2. propagating deltas through the network back to relevant connections
• little empirical evidence for this in biological brains
• real brains show a progressive reduct in reaction time as one learns
• not seen in ANNs where error decreases but prediction time remains constant
• ANNs require a ‘global truth’ or teacher
• these ‘perfect’ signals are not present in the real world
• Hebbian learning
• those who ‘vote with winners, become winners’
• can be used to produce learning in ANNs but not nearly as effective as backprop