Raj - 2024 : Lecture 4 | Zirou

Goal: Find parameter $W$ such that the loss function is minimized.

Gradient Descent

Many loss functions has no closed form, therefore, computing zero gradient (w.r.t. the model parameters) and checking Hessian is not always viable.

The algorithm

Initialize $W^{0}$ , $k = 0$ .
While $|f(W^{k+1}) - f(W^{k})| > \epsilon$ :
a. $W^{k+1} = W^{k} - \eta^k \cdot \nabla_{W} f(W^{k})^{T}$ .

where $\eta^k$ is the step size at the $k$ th iteration, and $\leq \epsilon$ is a termination criteria. The intuition is that, we are checking if increasing each model parameter would increase or decrease the loss.

Remark. For a neural network, when computing the gradient, one can think all parameters across all layers as elements of a single, large "vector of parameters".

Back to the learning problem

The training set $\{(X_i, d_i)\}_{i=1}^N$ , where $d_i = g(X_i)$ .
Minimize the loss: $L(W) = (1/N) \cdot \sum_{i} div(f(X_i; W), d_i)$ . Note that since the training set is fixed, $W$ is the only set of variables here.
Do gradient descent w.r.t $W$ .

The followings are to be defined: (all see the previous lecture)

The function $f(\cdot; \cdot)$ is a neural network.
a. Activation functions must be differentiable.
The training set consists of samples of the ground-truth target function $g(\cdot)$ .
a. E.g., $X_i \in \mathbb{R}^n$ , $g(X_i) = d_i \in \mathbb{R}^m$ .
The divergence function $div(\cdot; \cdot)$
a. Must be differentiable.
b. A list of candidate functions: Link

Notes

Gradient and level set: Link.
Hessian and Eigenvalues: Link
A list of activation functions: Link