-
To find the slope, we take the partial derivative of the error with
respect to the weight.
But the only element in the sum of error terms that depends on the weight is
the one for the output unit where that weight ends (j in what follows).
∂E/∂wji =
∂
[½ (tj - xj)2] /
∂wji
-
Using the chain rule, we can decompose this derivative into two that are easier
to calculate:
(∂[½ (tj - xj)2]/∂xj)
(∂xj/∂wji)
-
The first derivative is easy to figure; it's just
-(tj - xj)
-
The second derivative can be decomposed using the chain rule again if we
remember that the activation of unit j is a function of the input
to the unit, Ij, which is in turn a function of the weights into the unit.
∂xj/∂wji =
(∂xj/∂Ij)
(∂Ij/∂wji)
-
Since the activation of an output unit is the activation function f
applied to the input I, the first derivative on the right-hand side
of (4) is just
f'(Ij)
that is, the derivative of whatever the activation function is at the value
of the current input to unit j.
-
The second derivative on the right-hand side of (4) can be derived as
follows:
∂Ij/∂wji =
∂(∑kxkwjk)/∂wji =
xi
because none of the other weights or input activations depend on wji.
-
Putting all of the parts together, we get
∂E/∂wji =
-(tj - xj) f'(Ij) xi
-
Remember that we want the weight change to be proportional to the negative
of the derivative with respect to the weight. So with a learning rate
to control the step size for weight changes, we get the more general
delta (least mean squares) learning rule
Δwji = η (tj - xj) f'(Ij) xi