In this assignment you will implement reinforcement learning in the critters.
You will work with both the Brain and Memory classes.
You can submit the two files separately.
Go to the Vincent page for the class to upload the files.
Start by downloading the version of the program that you will use for this assignment.
As in the last assignment, for this assignment, the critters always either move, turn, or eat, and they have access only
to the touch sensor that gets the texture of the cell in front of them.
Since we won't deal with worlds containing predators, there are four possible textures (see the
constants in Sensor).
One difference is that the critters start out with much higher strengths than before (3000) so
they won't die before they learn how to survive in the world.
(They will probably die eventually anyway, but at least you'll get a chance to see them get somewhat
smarter first.)
Each Brain has a Memory (this part is already done), which maintains
a two-dimensional array of doubles for the Q values that the critter is learning.
The method that is in control of all of this is step() in Brain.
This method has comments in it for what it will eventually do when you are finished with
the assignment.
As you can see, the following basic processes are supposed to happen in step().
decide, which you will write.learn, which updates the Q value for the
last state-action pair, using the last reinforcement and then updates its short-term memory for
state and action to be the new pair.As you do each of the following, make sure that each works before going on to the next. Remember to recompile after every small change you make.
Memory, create the three "short-term memory" instance variables you will
need to keep track of the last state, last action, and last reinforcement, and
write the basic procedures that you will need to initialize, access, update,
and print out the array of Q values: initQ, getQ, setQ, printQs.
For printQ, notice that there is a variable for the critter's effector in
Memory so you can call say from this class.
Also you may want to use String arrays representing the names of
actions and textures, Brain.ACTIONS and Sensor.TEXTURES.
Here is what my printQ method does:Memory for critter 1 State EMPTY Action Move: -0.5516985999092149 Action Step: -0.47641345873882085 Action Eat: -0.5771634603578751 State HARD Action Move: -0.7165374999999999 Action Step: -0.14946906909375 Action Eat: -0.9372029062499998 State SOFT Action Move: -0.25 Action Step: -0.10097375 Action Eat: 0.0 State FUZZY Action Move: 0.0 Action Step: 0.0 Action Eat: 0.0Be sure to call
initQ in the constructor for Memory.
Memory that find the best Q value for a given state:
getHighestQ, which just returns this value (you will need this for your
learn method) and getBestAction, which returns the action (an integer)
associated with the best Q value for the given state (you will need this for your
decide method).
To test these methods, you could initialize your Q value table to values other than 0.0 and
then get the best Q value and best action for a given state.
decide in Brain.
It selects the best possible action with this probability:
Brain class variable
exploitationRate in the program.
a is the age of the critter, which you can get using the Sensor
method getAge.
To do ex, use Math.exp(x).
Otherwise decide should pick a random action.
Put a call to decide in step at the appropriate place.
learn in Memory.
It takes as parameters a new state index and a new action index.
(These values get passed to it when it is called in step
in Brain.)
It uses the Q-learning equation to update the
Q value for the last state and action, which are stored in
Memory instance variables.
state passed to the method as a parameter).
The learning rate, η, is the Memory class variable
learningRate, and the discount rate,
γ, is the Memory class variable discountRate.
The part of the equation beginning with max uses the method that you wrote
for part 2 above, getHighestQ.
After updating the Q value for the last state and action, learn
sets the last state and last action variables to be
the new state and action (which were passed to the method as parameters).
If you fail to get this last part
right, you may end up always changing the same Q value
in the table.
Note that learn has three side effects: it changes the values
of three instances variables in Memory, the variables representing
the Q values, the last state, and the last action.
Add a call to learn at the appropriate place in step in Brain.
For debugging, you will also want learn to show what it is doing, by calling
printQ, though you probably want to comment this out later when you run the program
for hundreds of steps.
step in Brain by actually attempting the selected action
and then passing the reinforcement that is returned to the memory so that it can update its
last reinforcement variable.
You should be able to figure out how to do this.
safe world, which has 15 plants, 10 rocks, and 2 critters in it.
Using the reinforcement parameters in that file, your critters should get a lot smarter, though probably not enough to survive indefinitely in that world.
You can see how they are doing by calling the Sensor method getStrength, which returns the current strength of the critter (remember that this starts at 3000).
In case you want it, you also have access to the total number of time steps that have elapsed since the program started in the Sensor method getStep.