Published on

Raj - 2024 : Lecture 4

Lecture 4

Goal: Find parameter WW such that the loss function is minimized.

Gradient Descent

Many loss functions has no closed form, therefore, computing zero gradient (w.r.t. the model parameters) and checking Hessian is not always viable.

The algorithm

  1. Initialize W0W^{0}, k=0k = 0.

  2. While f(Wk+1)f(Wk)>ϵ|f(W^{k+1}) - f(W^{k})| > \epsilon :

    a. Wk+1=WkηkWf(Wk)TW^{k+1} = W^{k} - \eta^k \cdot \nabla_{W} f(W^{k})^{T}.

where ηk\eta^k is the step size at the kkth iteration, and ϵ\leq \epsilon is a termination criteria. The intuition is that, we are checking if increasing each model parameter would increase or decrease the loss.

Remark. For a neural network, when computing the gradient, one can think all parameters across all layers as elements of a single, large "vector of parameters".

Back to the learning problem

  1. The training set {(Xi,di)}i=1N\{(X_i, d_i)\}_{i=1}^N, where di=g(Xi)d_i = g(X_i).

  2. Minimize the loss: L(W)=(1/N)idiv(f(Xi;W),di)L(W) = (1/N) \cdot \sum_{i} div(f(X_i; W), d_i). Note that since the training set is fixed, WW is the only set of variables here.

  3. Do gradient descent w.r.t WW.

The followings are to be defined: (all see the previous lecture)

  1. The function f(;)f(\cdot; \cdot) is a neural network.

    a. Activation functions must be differentiable.

  2. The training set consists of samples of the ground-truth target function g()g(\cdot).

    a. E.g., XiRnX_i \in \mathbb{R}^n, g(Xi)=diRmg(X_i) = d_i \in \mathbb{R}^m.

  3. The divergence function div(;)div(\cdot; \cdot)

    a. Must be differentiable.

    b. A list of candidate functions: Link

Notes

  1. Gradient and level set: Link.

  2. Hessian and Eigenvalues: Link

  3. A list of activation functions: Link