In this assignment you will implement reinforcement learning in the critters, this time
using neural nets, one for each critter.
Again you will be working with the Brain and Memory classes.
You can submit the two files separately.
Go to the Vincent page for the class to upload the files.
Each critter has the ability to touch all 6 cells around it and to know the texture of each
of these cells (though the number of touchable cells and the number of distinguishable textures can be changed in the world files).
On each time step, the brain's sense() method returns an array of doubles,
either 0.0 or 1.0 (they are doubles because that's what the neural network expects).
The representation of the sensory input assigns 4 positions in the array
to each touchable cell, one position for each of the 4 ditinguishable textures, 24 positions in all.
(Note that if states were represented locally, we would need an array of 46 = 65536 numbers for all of the possible states.)
Each position in this array of 24 numbers corresponds to a single input unit in the neural network for the critter.
The numbers in the array returned by sense() represent the cells around the critter starting from the cell in front of it and moving clockwise around it.
So the first 4 numbers represent the cell in front of the critter, the second 4 numbers represent the cell that is 60 degrees clockwise from the critter's heading, etc.
After calling sense(), the critter chooses an action on the basis of the state
that it gets.
It uses a more sophisticated algorithm for doing this than in the model in the
previous assignment.
The following equation gives the probability of picking an action
ui in a state x.
(Note: you do not have to understand this to do the rest of the assignment, but it will only work
if you've implemented the basic operations of the neural network.)
| P(ui) = | eE⋅a⋅Q(x,ui ) |
| ∑j eE⋅a⋅Q(x,uj) |
As before, the probability of picking an action depends on the constant E
(exploitationRate in the Memory class, not Brain, as in the last assignment)
and on a, the number of
learning experiences (the critter's age in the program).
But it also depends on the relative Q value associated with that state-action
pair, rather than simply on which Q value is highest for this state.
As long as two or more Q values remain close in value, this approach allows
some exploration to continue to happen almost indefinitely.
The Q values that are needed in the decision equation are stored in a neural
network rather than a lookup table.
The neural network has two layers of units that I'll call the External Input layer
and the Output layer.
The External Input layer represents a state,
a set of values that are returned when sense() is called in the
critter's brain.
That is, each state is represented in a distributed rather than a
local fashion.
The Output Layer of units in the network represents the possible actions available
to the critter, one unit for each action.
That is, the actions are represented in a local fashion.
To "look up" a Q value for a given state-action pair, we first "clamp" the external input units to a pattern representing the state (that is, set the activation of each external input unit to be the corresponding value in the state array). Then, we run the network. That is, we calculate the input into each output unit (using the dot-product rule) and then calculate the activation of each output unit. For this network, the activation function is just the identity function, so this last step just returns the input to the output unit. Then we read off the activation of the output unit that represents the action we want. This is the network's Q value for the given state and action.
To perform Q learning, that is, to update the Q value associated with a given state and action pair, the model uses the delta rule, an algorithm for updating the weights in a neural network on the basis of the network's error, that is, its deviation from a target. For a given input unit i and output unit j, the delta rule changes the weight on the connection joining the two as follows.
tj represents the target value for output unit j, x represents the activation of a unit, and η is a learning rate between 0 and 1.
For a given state and action, we know what the values of xi and xj for all input and output units are after we run the network. That leaves the value of tj for each output unit, the target that is required for delta-rule learning (and supervised learning generally). In Q learning, we only update the Q value for an action that is attempted in the given state; this means that we should only change the weights into the output unit for that action. So we set the error for all of the other output units to 0.0. The target for the output unit for the attempted action is then just the "new" part given by the Q learning rule.
The reinforcement part of this equation is provided by the world when the action is attempted, exactly as in the last assignment; what about the green term?
The discount factor, γ, is a constant as before (discountRate in Memory); that leaves the best Q value for the next state.
Once we know the next state — and we do on the next time step when sense() is
called again — we can simply run the neural network using this state as input
and find the highest activation (that is, the highest Q value) among the output
units.
This is the best Q value for the next state, that is, that part of the expression
following γ.
In sum, then, learning works as follows.
We've saved the last state, the last action, and the reinforcement that resulted from
these in the brain's memory, and now sense() tells us the next state.
We run the network with this next state as its input to find the best value for the next state, combine that value with the reinforcement using the equation above to give us the target for
the output unit corresponding to the last action.
We then run the network again, this time with the last state as input (using the
Memory instance variable lastState, except that now this is an array of doubles).
This gives us an output pattern so that we can apply the delta rule
(above) using the target we just calculated to update the weights into the output unit for the last action.
Look at the step method in the Brain class.
It starts with a call to sense().
which, unlike in the last assignment,
returns an array of doubles.
This array of doubles is what is used, both to decide() on what action
to take and, in the critter's memory, to learn().
Next, decide() selects an action, using the equation in the last section,
returning this action (an integer).
Next, learn() updates the weights in the neural network (stored in
the memory) for the last state and action.
Finally, the critter attempts to execute the new action, returning
the reinforcement and passing this to its memory (exactly as before).
In this assignment, the set of possible actions available to the Critter
is always the same, MOVE, TURN in any of the five directions, and REST.
Note that there is no EAT action; the critters eat by MOVing onto plants.
RESTing involves no turn or movement and costs nothing (except the cost that each step
accrues).
Start by downloading the version of the program that you will use for this assignment. Also download the completed program if you want to see how it should work when it is done.
Most of what you will do is in Memory, but some of it is already done for you.
Before you start, look through the documentation for Memory to see what is there.
All of the instance variables needed for the neural network are already defined,
and the method that initializes them, initQs(), is defined and called in the
constructor.
Note that getQ() works differently than before since it takes only
an action integer and no state argument.
This is because this method is used after the network has been run with a particular
state as input.
If you click on a critter in the world window, its neural network will pop up.
To make the network show up in this window, you will have to call
the method showNN somewhere.
Note that this method has two versions; use the one with target arguments if you want
a target to be displayed.
After you get the neural network working, whe you click on a cell in a neural network
window, its value (activation for units, weights for connections) will appear in the message window.
There's also a method, showQ, which prints all of the weights in the neural network
in the message window; you can use this to help you debug.
activate() that takes an input pattern
(an array of doubles), clamps it on the network's external input units, and then
activates the network's output units.
(Note that this method is already there, because it is required for
getBestState, but it doesn't do anything.)
The activate() method uses a method that is already defined, clamp(), and
then uses the general neural network rule for sending input to a unit, the dot product of the
vector of input units and the weights from them to the destination unit.
For this you can use the built-in static method in Utils, dotProduct.
Because the activation function of the output units in the networks is just the identity function, you don't have to worry about it.
You can debug your activate method by initializing the weights in the network to values other than 0.0 (in initQs) and watching what happens to the output units (in the network
window) when the method is called with a particular input pattern, for example, from
sense(). (Of course you will have to call it in step() in order to see this, and you will have to call showNN() in activate() or the network won't show up in the window.)
figureError() that takes an integer representing an output
unit index and a double representing the target for this unit.
It then sets the values in the error array (an instance variable) accordingly.
For all units other than the indicated one, the error is 0.
For the indicated one, it is the difference between the given target and the network's activation
for that output unit.
updateWeights() that changes all the weights in the network using the
delta rule.
This method doesn't have to take any arguments because everything is available in instance variables:
error, weights, and externalInputs.
activate(), figureError(), and updateWeights() in step() and see what happens to the weights.
(Try setting the learning rate in the world file to 1.0 so that learning happens very quickly.)
Of course these Memory methods will have to be public if you are going to call them in step()
in Brain.
getHighestQ(), which, as before, you will need for your learn()
method.
It takes a state array and returns an integer representing the action that the memory thinks is
best.
It does this by calling activate() (the method that you defined in 1) with the state
and then finding the output unit that has the highest activation.
learn(),
which is already started for you.
As before, it takes a new state (now an array of doubles instead of an integer) and a new action,
and learns about the lastState and lastAction, using the lastReinforcement.
It first figures the target Q value for the action output unit.
To do this, it just uses the Q learning equation (only the "new" part, that is, the part that
combines reinforcement with estimated next state value).
The reinforcement is the last reinforcement.
To get the other term, use the method that you just defined, getHighestQ.
Then the method runs the network by calling activate() with the
lastState as input.
It then calls figureError(), using the target found at the beginning,
to set the error, and then calls updateWeights() to change
the weights in the network.
Finally, just as in the last assignment, it updates the lastState
and lastAction variables.
safe world file and your program is working right,
about half of your critters should survive more or less indefinitely.
You can experiment with different values of the learning rate, discount rate,
and exploitation rate, or, if you want, with different reinforcements (moveCost, eatGood, etc.).
Report informally on the performance of your critters in comments at the top of your
Memory file.