One output unit for each possible action (local representations)
Q values
Weights connecting state input units to action output units
With distributed state representations, a separate value for
every pair of state features and actions
With a hidden layer, Q values are distributed across input-to-hidden
and hidden-to-output weights
Operations
Finding a Q value for a given state and action (getQ())
Clamp the pattern for the state on the input layer.
Run the network.
Read off the activation of the output unit for the action.
Finding the highest Q value and best action for a given state
(getHighestQ(), getBestAction())
Clamp the pattern for the state on the input layer.
Run the network.
Find the output unit with the highest actvation.
That's the highest Q value, and the index of that unit is the
best action.
Learning the Q value for a state x and action u
Try the action and receive the reinforcement (updating the variable
in the memory).
On the next time step, find the highest Q value for the next state
for x and u by running the network with
state xt+1 and finding the highest output.
Run the network with state x.
Find the output unit j corresponding to action u.
Only the weights into this unit will be updated.
Calculate the target for output unit j.
This is just the "new" part of the Q-learning rule: the reinforcement
plus the discounted best value for the next state.
Calculate the error on output unit j.
Update the weights into unit j using the delta rule.
Note that the "old" and "new" parts of the lookup-table version
of the Q-learning rule are built into the weight update rule,
which combines old and new values for each weight.
Cognitive Science at Indiana University | Fall 2004