Published on

Raj - 2024 : Lecture 3

Lecture 3

A key assumption for this lecture: The NN has a sufficient architecture (e.g., sufficiently deep and wide) to model the function desired, unless stated otherwise.

Learning unknown functions

Neural Networks: f(X;W)f(X; W), XX is input (before semicolon), WW are the parameters.

  • The form of the target function g(X)g(X) is unknown - need to sample.

  • Based on the samples, learn a set of parameters WW to approximate the true function g(X)g(X) by minimizing the empirical error (however the error is defined).

    • E.g., Binary classification: minimize the count of misclassifications.
  • The hope is that, the learned f(X;W)f(X; W) will fit the target function g(X)g(X) everywhere, despite that it has only seen the samples.

  • An example of learning decision boundary for binary classification: Link.

    • Learn WW such that for each input XX, WX0W \cdot X \geq 0 iff XX is a positive example (This may not be possible since data might not be separable by a hyperplane).

A single-perceptron learning algorithm

Note: (i)(i) Only a single neuron with threshold-00 activation function in the hidden layer. (ii)(ii) This algorithm is for binary classification only.

  1. Randomly initialize WW.

  2. For each training instance (Xi,yi)(X_i, y_i), yi{1,+1},  i=1,...,Ny_i \in \{-1, +1\}, \; i = 1, ..., N

    a. Let y^i=sign(WXi)\hat{y}_i = \text{sign}(W \cdot X_i).

    b. If y^iyi\hat{y}_i \neq y_i, then W=W+y^iXiW = W + \hat{y}_i X_i.

  3. Repeat step 2 until zero misclassifications.

Remarks.

  1. The algorithm converges if and only if the classes are linearly separable. Otherwise, it never converges.

  2. This is an example of limitations of insufficient architecture. In this case, an NN with only one neuron in the only hidden layer.

  3. An alternative way to look at this problem:

    a. Reflect each datapoint XiX_i by yiy_i, that is, we obtain a new Xi=yiXiX_i' = y_i X_i.

    b. Find a hyperplane (where WW is the normal vector) such that all reflected points lie on the same side of the plane.

    c. If such a hyperplane exists, then a trivial solution is setting W=1/NiXiW = 1 / N \cdot \sum_i X_i'.

When classes are not linearly separable

One needs multiple perceptrons, possibly one perceptron for each decision boundary. However, note that the single-perceptron algorithm (stated above) cannot be naively applied here on each perceptron, unless one tries every possible data relabelling for each perceptron. Learning is hard.

Issues are introduced by a combination of (i)(i) threshold activation functions (either 00 or infinite gradients), and (ii)(ii) classification errors calculation (discrete and not differentiable). Therefore, algorithms based on incremental improvement is unlikely to work here because a small change in the weights often does not lead to a change in the resulting loss, unless a the threshold condition is satisfied.

Source: The class lecture note (Link)

Both the activation function and the error function should be differentiable w.r.t. WW. Note that, whatever error function we use, it should be a good proxy for classification error.

Let div(f(W;X),g(X))\text{div}(f(W; X), g(X)) be the divergence (loss). One can then find WW to minimize the empirical loss:

1Nidiv(f(W;Xi),g(Xi))\frac{1}{N} \sum_{i} \text{div}(f(W; X_i), g(X_i))

which estimates the true loss E[div(f(W;Xi),g(Xi))]\mathbb{E} [\text{div}(f(W; X_i), g(X_i))]; XX is the random variable.