Problems that require hidden units: when there is no set of
weights that will solve a problem without a hidden layer,
functions that are not linearly separable
With a hidden layer and a non-linear activation function, there is
a set of weights for any arbitrary mapping; the question is how to find it
Backpropagation: a gradient-descent procedure for learning the
weights into hidden units as well as into output units
Activation function must be differentiable; usual choice (sigmoid
function):
Some quesions and concerns
Does BP get stuck in local minima?
Does it take forever to learn the weights?
Faster as number of hidden units increases (assuming parallel update)
Faster with higher learning rate, within limits
How does the network solve the problem? What sort of
hidden-layer representations does it build? Using statistical
techniques to analyze hidden-layer representations.
Does it generalize? Does the network behave appropriately on
patterns which it has not been trained on?
More local and more distributed (greater generalization)
hidden-layer patterns
Effect of too many trainable connections: overfitting, the
network "memorizes" individual patterns rather than generalizing
over them
Optimization: setting the learning rate, other parameters
Incremental training: learning a simpler task which enables
the learning of a more complex task
Multiple tasks in a single network
Catastrophic forgetting: does the network unlearn one set of
patterns when trained on a second?
Does the network fail to learn two interfering tasks which it is
trained on simultaneously?
Modularity as a solution to problems of interference
Problems
Neural implausibility of the algorithm and the training procedure