Introduction to reinforcement learning - hands on

Come to our RL reading group on Wednesdays, 3-4 PM at Soltani Lab!
Bi-weekly, this week Oct 25
Table of Contents

I. Concepts: Reward, action, agent, environment

Policy: defines a probability of choosing an action (can be based on the action value or not).
greedy, epsilon-greedy, or softmax as three examples (exploration versus exploitation)
greedy can stuck at local minima
Or no action (e.g., when you are learning the value of painful stimulus without doing any actions)
From Sutton and Barto
Policy:
Reward signal:
Value function:
Action choices
Model of the environment

II. Single-state environment:

The RW model encapsulates the most basic type of value-based, error-driven learning. A running internal estimate of value at each time step is denoted . is updated based on an error signal, or Prediction Error (PE) -- the difference between observed reward and .
Learning proceeds gradually, as a function of the learning rate α. so that the internal model of value learns to track the long-term running average. If there is no learning. If then the agent's expected value is simply the previous reward
This equation describes the agent's learning process:
can be generalized to any experience imbued with value. i.e., it could be a punishment.
This framework can be used to model learning of the values associated with a set of cues, options, or actions. It is a generic learning mechanism, and can also be used to assign value to states.
Typically, the agent has one internal value representation Vfor each cue, option, or action.
The example below explores this using a 2-armed Bandit problem (simlar to the active inference problem that we were dealing with in the previous module). in a 2-choice task (i.e., choose blue vs. choose red), the agent would typically have two value representations for each of the 2 options, i.e., and . When the agent experiences a reward, the value associated with the reward is updated.
Note that in all RW models, there is no representation of future states or transitions across states. There is only experienced reward (or punishment) and assignment of credit to a linked cue/action/etc.

A. Rescorla-Wagner

Reversal learning task with Rescorla Wagner model:

Let's code the simulated behavior!
setting up the environment and agent parameter:
clear
% define the enviornment:
option_blue=[ones(100,1)*0.8;ones(30,1)*0.2;ones(30,1)*0.8;ones(30,1)*0.2];
option_red=1-option_blue;
% define learning rate:
alpha =0.5
alpha = 0.5000
% define the inverse temprature for the softmax policy:
beta = 0
beta = 0
decision making and learning:
rng('shuffle');
% we assume that values for blue and red options are learned:
V_blue = 0.5*ones(length(option_blue)+1,1);
V_red = 0.5*ones(length(option_red)+1,1);
 
for ii=1:length(option_blue)
 
% decision making (let's assume that the agent uses softmax policy)
Pr_blue(ii) = exp(V_blue(ii)*beta)/(exp(V_blue(ii)*beta)+exp(V_red(ii)*beta));
if rand < Pr_blue(ii)
choice(ii)=1; % 1 is blue and 0 is red
else
choice(ii)=0;
end
 
% Reward
if choice(ii)==1
R(ii) = rand < option_blue(ii);
else
R(ii) = rand < option_red(ii);
end
 
% learning
if choice(ii)==1 % if the blue option has been chosen
V_blue(ii+1)=V_blue(ii)+alpha*(R(ii)-V_blue(ii));
V_red(ii+1)=V_red(ii);
else
V_blue(ii+1)=V_blue(ii);
V_red(ii+1)=V_red(ii)+alpha*(R(ii)-V_red(ii));
end
 
end
Now let's have a look at the generated behavior:
figure
plot(option_blue,'b')
ylim([0,1])
title('reversal learning task')
ylabel('Internal Value of the blue option')
xlabel('trial number')
hold on
plot(V_blue(1:end-1),'r') % also, do this for Pr_blue and choice
legend({'Truth (generative process)' 'Agent''s value'})
 
 

Questions to answer

  1. What do you predict the effect of increasing beta to be? Why?
  2. What do you predict the effect of increasing alpha to be? Why?
Extend the simulation to plot Probability of choosing blue, choice, and reward.
3. What values of alpha and beta would maximize reward, approximately? Why?

III. Multi-state environment

RW provides a basic learning mechanism -- error driven learning. Often, rewards are associated with states (s) and actions (A). We can think of states as locations on a grid, but the concept is used very generally to refer to any set of environmental constraints or any set of context variables. Representing states becomes useful when different states are associated with different reward contingencies. e.g., one location in a maze may contain food, and another may not.
States are also associated with the ability to transition to other, future states. This is usually represented probabilistically in terms of a state transition matrix. They are also associated with actions. For example, in a "food-present" state you may be able to take the action "eat", but in a "food-absent" state this action is not available. represents the expected value at state s. includes an internal representation of the expected current and future reward from all future states. Future rewards are often discounted based on how many transitions (e.g., time steps) are required to obtain the reward. Thus:
represents the value of a particular action a action in a particular state s .
V (state value) and Q (state-action value) matrices:
policy: probability of choosing actions in each state (state-dependent),
In QL or SARSA: mapping the Q values into the probability of choosing an action.

A. model free versus model based

What is a model? a model predicts what the environment does in the next step!
Note that in all RW models, there is no representation of future states or transitions across states. There is only experienced reward (or punishment) and assignment of credit to a linked cue/action/etc. The lack of explicit representation of future states and contingencies makes the learning "model free".
In model-based learning, the agent knows (or learns) the model of the environment (i.e., state transition probabilities, expected reward proababilities, etc).

B. TD learning (model free)

The overarching value function is listed as:
Benefits of TD learning
Question: How is the RPE is different in TD learning versus Rescorla-Wagner?
It is actually the same concept! But how?
comparison between the SARSA and QL:
Deep Q-Learning if the space is too large!