Q learning tries to find a value for each pair of a state (x)
and an action (u).
Each Q-value is a function of a state and an action: Q(x, u).
Obviously the Q-value for a given x, u pair
needs to be based on the reinforcement that the
environment gives the agent when the agent performs u in x
(that is, what the environment's reinforcement function returns).
But sometimes there's no reinforcement; instead an action may make a
future reinforcement possible.
So the Q-value also needs to somehow take into account what happens in the next state
(that is, what the environment's next-state function returns).
The idea is to take the Q-value to be the sum of all future
reinforcements received by the agent when taking that action and then
following its policy.
Of course the agent can't know this, so it has to estimate it.
To do this, it treats the Q-value of a state-action pair
recursively (that is, the value is defined in terms
of itself). The value is (a function of)
the immediate reinforcement received and
(what the agent currently believes to be)
the value of the best action in the new state that is reached.
So at any given point during learning, the agent has an estimate of
the values for all state-action pairs.
On a given learning trial, the agent is in state xt.
It does the following.
Selects an action ut and executes it.
Notes whatever reinforcement it receives as a result:
r(xt, ut)
Notes what new state it is in: xt+1.
Finds what it believes to be the
value of the best action in the new state:
Qbest(xt+1, u)
Combines 2 and 4 to update its estimated value for the pair
xt, ut.
How should we combine the immediate reinforcement with the estimated
best value of the next state?
Just adding them may create problems if the future reinforcements are
infinite (for example, if there's no obvious end to the series of actions),
so usually it is the discounted best value of the next state
that is added, that is, that estimate times a discount rate γ,
ranging between 0 and 1.
When γ is 0,
only immediate reinforcement matters; when it is 1, all reinforcements are
equally weighted.
So we could update the Q-value for a state-action pair using this
update equation:
The expression beginning with "max" means for all of the possible actions
in the next state (ut+1), the highest value of those
that agent has stored for that state in its LTM.
But the information we get on a given learning trial may be misleading,
so this update equation moves too quickly.
Instead we usually move relatively slowly in the direction indicated by the
current evidence; that is, there is a learning
rate (η), which controls the step
size of the learning.
With this, the update equation is:
This equation weights combines the old knowledge that the agent has with
the new information coming from the current experience of receiving a
reinforcement and ending up in a new state.
Cognitive Science at Indiana University | Fall 2004