- Published on
Raj - 2024 : Lecture 3
A key assumption for this lecture: The NN has a sufficient architecture (e.g., sufficiently deep and wide) to model the function desired, unless stated otherwise.
Learning unknown functions
Neural Networks: , is input (before semicolon), are the parameters.
The form of the target function is unknown - need to sample.
Based on the samples, learn a set of parameters to approximate the true function by minimizing the empirical error (however the error is defined).
- E.g., Binary classification: minimize the count of misclassifications.
The hope is that, the learned will fit the target function everywhere, despite that it has only seen the samples.
An example of learning decision boundary for binary classification: Link.
- Learn such that for each input , iff is a positive example (This may not be possible since data might not be separable by a hyperplane).
A single-perceptron learning algorithm
Note: Only a single neuron with threshold- activation function in the hidden layer. This algorithm is for binary classification only.
Randomly initialize .
For each training instance ,
a. Let .
b. If , then .
Repeat step 2 until zero misclassifications.
Remarks.
The algorithm converges if and only if the classes are linearly separable. Otherwise, it never converges.
This is an example of limitations of insufficient architecture. In this case, an NN with only one neuron in the only hidden layer.
An alternative way to look at this problem:
a. Reflect each datapoint by , that is, we obtain a new .
b. Find a hyperplane (where is the normal vector) such that all reflected points lie on the same side of the plane.
c. If such a hyperplane exists, then a trivial solution is setting .
When classes are not linearly separable
One needs multiple perceptrons, possibly one perceptron for each decision boundary. However, note that the single-perceptron algorithm (stated above) cannot be naively applied here on each perceptron, unless one tries every possible data relabelling for each perceptron. Learning is hard.
Issues are introduced by a combination of threshold activation functions (either or infinite gradients), and classification errors calculation (discrete and not differentiable). Therefore, algorithms based on incremental improvement is unlikely to work here because a small change in the weights often does not lead to a change in the resulting loss, unless a the threshold condition is satisfied.
Source: The class lecture note (Link)
Both the activation function and the error function should be differentiable w.r.t. . Note that, whatever error function we use, it should be a good proxy for classification error.
Let be the divergence (loss). One can then find to minimize the empirical loss:
which estimates the true loss ; is the random variable.