Homework 3: reinforcement learning

Due Sunday, 17 October, 23:59.

General instructions

In this assignment you will implement reinforcement learning in the critters. You will work with both the Brain and Memory classes. You can submit the two files separately.

Go to the Vincent page for the class to upload the files.

Specific instructions

Start by downloading the version of the program that you will use for this assignment.

As in the last assignment, for this assignment, the critters always either move, turn, or eat, and they have access only to the touch sensor that gets the texture of the cell in front of them. Since we won't deal with worlds containing predators, there are four possible textures (see the constants in Sensor). One difference is that the critters start out with much higher strengths than before (3000) so they won't die before they learn how to survive in the world. (They will probably die eventually anyway, but at least you'll get a chance to see them get somewhat smarter first.) Each Brain has a Memory (this part is already done), which maintains a two-dimensional array of doubles for the Q values that the critter is learning. The method that is in control of all of this is step() in Brain. This method has comments in it for what it will eventually do when you are finished with the assignment. As you can see, the following basic processes are supposed to happen in step().

  1. The brain gets a current state.
  2. On the basis of this state and the critter's age, it decides on an action, using the method decide, which you will write.
  3. In brain, it calls learn, which updates the Q value for the last state-action pair, using the last reinforcement and then updates its short-term memory for state and action to be the new pair.
  4. Next the selected action is attempted, and the resulting reinforcement is passed to the memory so that it can update its short-term memory for this.

As you do each of the following, make sure that each works before going on to the next. Remember to recompile after every small change you make.

  1. In Memory, create the three "short-term memory" instance variables you will need to keep track of the last state, last action, and last reinforcement, and write the basic procedures that you will need to initialize, access, update, and print out the array of Q values: initQ, getQ, setQ, printQs. For printQ, notice that there is a variable for the critter's effector in Memory so you can call say from this class. Also you may want to use String arrays representing the names of actions and textures, Brain.ACTIONS and Sensor.TEXTURES. Here is what my printQ method does:
    Memory for critter 1
      State EMPTY
        Action Move: -0.5516985999092149
        Action Step: -0.47641345873882085
        Action Eat: -0.5771634603578751
      State HARD
        Action Move: -0.7165374999999999
        Action Step: -0.14946906909375
        Action Eat: -0.9372029062499998
      State SOFT
        Action Move: -0.25
        Action Step: -0.10097375
        Action Eat: 0.0
      State FUZZY
        Action Move: 0.0
        Action Step: 0.0
        Action Eat: 0.0
    
    Be sure to call initQ in the constructor for Memory.
  2. Write two methods in Memory that find the best Q value for a given state: getHighestQ, which just returns this value (you will need this for your learn method) and getBestAction, which returns the action (an integer) associated with the best Q value for the given state (you will need this for your decide method). To test these methods, you could initialize your Q value table to values other than 0.0 and then get the best Q value and best action for a given state.
  3. Write the method decide in Brain. It selects the best possible action with this probability:
    P = 1 - e-E a
    In this equation, E is a constant, the Brain class variable exploitationRate in the program. a is the age of the critter, which you can get using the Sensor method getAge. To do ex, use Math.exp(x). Otherwise decide should pick a random action. Put a call to decide in step at the appropriate place.
  4. Write learn in Memory. It takes as parameters a new state index and a new action index. (These values get passed to it when it is called in step in Brain.) It uses the Q-learning equation to update the Q value for the last state and action, which are stored in Memory instance variables.
    Q-learning
    For r it uses the last reinforcement, which is stored in another instance variable. Recall that this is an update equation; it shows how to set the Q value for a particular state and action on the basis of the current Q value for that state and action, Qold(xt, ut), the reinforcement received, and the new information just received, that is, the highest value for the next state (which is just the state passed to the method as a parameter). The learning rate, η, is the Memory class variable learningRate, and the discount rate, γ, is the Memory class variable discountRate. The part of the equation beginning with max uses the method that you wrote for part 2 above, getHighestQ. After updating the Q value for the last state and action, learn sets the last state and last action variables to be the new state and action (which were passed to the method as parameters). If you fail to get this last part right, you may end up always changing the same Q value in the table. Note that learn has three side effects: it changes the values of three instances variables in Memory, the variables representing the Q values, the last state, and the last action. Add a call to learn at the appropriate place in step in Brain. For debugging, you will also want learn to show what it is doing, by calling printQ, though you probably want to comment this out later when you run the program for hundreds of steps.
  5. Complete step in Brain by actually attempting the selected action and then passing the reinforcement that is returned to the memory so that it can update its last reinforcement variable. You should be able to figure out how to do this.
  6. Test your program on the safe world, which has 15 plants, 10 rocks, and 2 critters in it. Using the reinforcement parameters in that file, your critters should get a lot smarter, though probably not enough to survive indefinitely in that world. You can see how they are doing by calling the Sensor method getStrength, which returns the current strength of the critter (remember that this starts at 3000). In case you want it, you also have access to the total number of time steps that have elapsed since the program started in the Sensor method getStep.

Home

Calendar

Coursework & grading

Assignments

Lecture notes

Other resources


IU home

IU CS home

Contact instructor