When we minimize or maximize a function, we call it optimization.

Gradient descent is essentially an iterative optimization algorithm that takes a guess and refines it using the gradient to make a better guess.

If the objective function is a convex function, then it will converge to a *global optimum*.

Gradient descent finds critical point of differentiable function. Which can be faster than normal equations for large ‘d’ values. It takes $O(nd)$ per iteration so $O(tnd)$ for $t$ iterations.

Formally,

- We start with an initial guess: $w_{0}$
- We repeatedly refine the guess: $w_{t+1}=w_{t}−α_{t}∇f(w_{t})$
- $α$ here is the learning rate.
- We move in the negative gradient direction as given some parameters $w$ the direction of largest decrease is $−∇f(w)$

- We stop when $∥∇f(w_{t})∥≤ϵ$

## Stochastic Gradient Descent (SGD)

However, the runtime of each iteration of regular gradient descent is proportional to $n$. This is problematic when we have large training sets!

Instead of computing the gradient over all training examples, we do it for some random training example $i$. Intuition is that *on average*, the algorithm will head in the right direction

We can use it when minimizing averages (so all regression losses except brittle regression)

When we get close enough to a local minima $w_{∗}$, we enter a region of confusion where some $∇f_{i}(w)$ point towards $w_{∗}$ and others don’t. This confusion region is captured by the variance of the gradients

- If the variance is 0, every step goes in the right direction (outside region of confusion)
- If the variance is small, most steps go in the right direction (just inside region of confusion)
- If the variance is large, many steps point in the wrong direction (middle of region of confusion)

Basically, for a fixed stepsize, SGD makes progress until the variance is too large.

### Decreasing Step Size

If we decrease the step size as we keep training, we can still converge to a stationary point as long as:

$∑_{t=1}α_{t}∑_{t=1}(α_{t})_{2} =0$ A common option is to use $α_{t}=O(t 1 )$

### Minibatches

We can train on a ‘mini-batch’ $B_{t}$ of examples. Radius of region of confusion is inversely proportional to $B_{t}$

### Early Stopping

Normally, we stop GD when gradient is close to zero. However, we never know this when doing SGD (as we cannot see the full gradient). We just stop early if the validation set error is not improving (this also reduces overfitting)