Convolutional Neural Networks

Rather than picking from fixed convolutions, we learn the elements of the filters. A convolution is a linear filter that measures the effect one signal has on another signal.

If $x$ is the $(n, n)$ input signal (image) and $w$ is the $(2 m + 1, 2 m + 1)$ filter, then the 2D convolution is given by

$z [i_{1}, i_{2}] = \sum_{j_{1} = - m}^{m} \sum_{j_{2} = - m}^{m} w [j_{1}, j_{2}] x [i_{1} + j_{1}, i_{2} + j_{2}]$

Convolutional Layer

Standard is DxWxH

$K$ is the number of filters, $F$ is the spatial extent of filters (kernel size), $S$ is the stride, and $P$ is the padding

$W_{o u t} = (W_{in p u t} - F + 2 P) / S + 1$
$H_{o u t} = (H_{in p u t} - F + 2 P) / S + 1$
$D_{o u t} = K$

Total number of learnable parameters: $(F \times F \times D_{in p u t}) \times K + K$ .

Pooling Layer

Makes representation smaller, more manageable and spatially invariant.

$W_{o u t} = (W_{in p u t} - F) / S + 1$
$H_{o u t} = (H_{in p u t} - F) / S + 1$
$D_{o u t} = D_{in p u t}$

Total number of learnable parameters: 0.

Layer Summary

Convolutional Layer: applies a set of learnable filters
Pooling Layer: performs spatial downsampling
Fully-connected Layer: same as any regular neural network

A CNN then just learns a hierarchy of filters

Properties of Convolution

Associative. $G \otimes (F \otimes I (x, y)) = (G \otimes F) \otimes I (x, y)$
Symmetric. $(G \otimes F) \otimes I (x, y) = (F \otimes G) \otimes I (x, y)$

Correlation, on the other hand, is generally not associative.

For 1D Gaussians, we note $G_{σ_{1}} (x) \otimes G_{σ_{2}} (x) = G_{σ_{1}^{2} + σ_{2}^{2}} (x)$ . Convolving with $G_{σ} (x) \otimes G_{σ} (x) = G_{2 σ} (x)$

Boundary Effects

Ignore these locations: make the computation undefined for the outsize $k$ rows/columns
Pad with zeroes: return zero whenever of value of $I$ is required at some position outside the image
Assume periodicity: wrap image around
Reflect border

Pillbox

A 2D pillbox is rotationally invariant but not separable

f (x, y) = \frac{1}{π r ^{2}} {10 if x^{2} + y^{2} \leq r^{2} otherwise

An efficient implementation would represent a 2D box filter as the sum of a 2D pillbox and some “extra corner bits”

Gaussian Filters

Box filter doesn’t apply well for lens defocus. A circular pillbox is a much better model for defocus
Gaussian is a good general smoothing model
- for phenomena
- whenever the CLT applies

Gaussian filters are rotationally invariant.

We get $G_{σ} (x, y) = \frac{1}{2 π σ ^{2}} exp^{- \frac{x ^{2} + y ^{2}}{2 σ ^{2}}}$ where $σ$ is the standard deviation

For a 3x3, we then need to quantize and truncate it, evaluating $G_{σ} (x, y)$ wherever in the filter. Increasing $σ$ means more blur. Problem with 3x3 is that it truncates too much of the distribution (does not sum up to one), this can cause unintentional darkening.

In general, the Gaussian filter should capture $\pm 3 σ$ for $σ = 1$ which gives us a 7x7 filter.

Efficiency

As both the 2D box filter and 2D Gaussian filter are separable, it can be implemented as two 1D convolutions which convolve each row and then each column separately.

A 2D filter is separable if it can be expressed as an outer product of two 1D filters

A seperable 2D Gaussian only does $2 m$ multiplications at each pixel (one for each 1D filter). Considering the image has $n \times n$ pixels, then this is a $2 m \times n^{2}$ multiplications. Assuming $m \approx n$ , this is $O (n^{3})$

Fourier Transform

The basic building block of the fourier transform is the periodic function.

$A s in (ω x + ϕ)$

where $A$ is the amplitude, $ω$ is the angular frequency and $ϕ$ is the phase. Fourier’s claim was that you could add enough of these to get any periodic signal!

The Convolution Theorem

Let $i^{'} (x, y) = f (x, y) \otimes i (x, y)$ be the convolution.

Then, $I^{'} (w_{x}, w_{y}) = F (w_{x}, w_{y}) I (w_{x}, w_{y})$ which is just a simple element-wise multiplication after applying a Fourier transform to each.

At the expense of two Fourier transforms and one inverse Fourier transform, convolution can be reduced to (complex) multiplication. This speeds up the cost of FFT/IFFT for the image and filter to $O (n^{2} lo g n)$ and $O (m^{2} lo g m)$ respectively, dropping the total cost of convolution to $O (n^{2})$

Convolution Sizing

Convolving two filters of size $m \times m$ and $n \times n$ results in a filter of size

$(n + 2 ⌊ \frac{m}{2} ⌋) \times (n + 2 ⌊ \frac{m}{2} ⌋)$

More broadly for a set of $K$ filters of sizes $m_{k} \times m_{k}$ the resulting filter will have size

$(m_{1} + 2 \sum_{k = 2}^{K} ⌊ \frac{m _{k}}{2} ⌋) \times (m_{1} + 2 \sum_{k = 2}^{K} ⌊ \frac{m _{k}}{2} ⌋)$

jzhao.xyz

Recent Writing

2024: Centering

Taste is a guide for what is worthwhile

Agentic Computing

Building a BFT JSON CRDT

Recent Notes

TrueTime

Concurrency control