Introduction to Machine Learning and Data Mining

Reinforcement Learning

Kyle I S Harrington / kyle@eecs.tufts.edu

Some material adapted from Rich Sutton and Andy Barto

## Reinforcement Learning ![Cover of Reinforcement Learning by Sutton and Barto](https://webdocs.cs.ualberta.ca/~sutton/book/cover.gif) [Reinforcement Learning: An Introduction by Sutton and Barto](https://webdocs.cs.ualberta.ca/~sutton/book/the-book.html)

## Reinforcement Learning Learn actions to take using a reward signal to reinforce desired behaviors Uses: - Game playing (i.e. chess, backgammon, go) - Control (i.e. regulating components at a factory) - Navigation (i.e robot vacuum cleaner)

## Agent-Environment ![Agent environment](http://kephale.github.io/TuftsCOMP135_Spring2016/Lecture13/images/suttonBarto_agentEnvironment.png) Agent transitions through states by making actions, while receiving rewards

## N-Armed Bandit ![Multi-armed bandit cartoon](http://research.microsoft.com/en-us/projects/bandits/MAB-2.jpg) Action, $a$: pull one machine's arm Reward, $r$: payoff for that particular machine *Given some observations of $a$ and $r$ pairs, what is the best action to take?*

Action-Value

The estimated value of an action after the $t$th observed reward is

$Q_t(a) = \frac{r_1+r_2+...+r_{k_a}}{k_a}$

where $k_a$ is the number of times that $a$ has been chosen. The true action value function will be denoted $Q^*(a)$

## Policies An agent selects an action, $a$, from a state, $s$, with a policy, $\pi$. **Example policies:** - *greedy*, always take the expected best action - *$\epsilon$-greedy*, greedy with $\epsilon$ probability of a random action - *softmax*, use a Boltzmann distribution A key characteristic of policies is the exploration-exploitation tradeoff.

## Reinforcement Learning Variables There are a number of variables we'll be using: - $a$, action - $s$, state - $t$, timestep - $r$, reward - $\alpha$, learning rate - $\gamma \in [0,1)$, discounting rate - $V(s)$, state value function - $Q(s,a)$, state-action value function - $\pi(s)$, policy (maps to an action/probability)

Back to Action-Value

The estimated value of an action after the $t$th observed reward is

$Q_t(a) = \frac{r_1+r_2+...+r_{k_a}}{k_a}$

where $k_a$ is the number of times that $a$ has been chosen. The true action value function will be denoted $Q^*(a)$

What happens after billions of observed rewards?

Incremental Action-Value

Never fear, we can calculate $Q_t(a)$ incrementally

$Q_t(a) = \frac{r_1+r_2+...+r_{k_a}}{k_a}$

For some Q-value

$Q_{k+1} = \frac{1}{k+1} \displaystyle \sum_{i=1}^{k+1} r_i$

...

$Q_{k+1} = Q_k + \frac{1}{k+1} [ r_{k+1} - Q_k ]$

Incremental Updates

A general form of the incremental update rule

$NewEstimate \leftarrow OldEstimate + StepSize [Target - OldEstimate]$

## Markov Property In general, the current state and reward depend on the entire sequence of observations ![State-reward depending on all previous states](https://webdocs.cs.ualberta.ca/~sutton/book/ebook/numeqtmp13.png) A problem with the *Markov* property only depends on the previous state ![State-reward depending on immediately previous state only](https://webdocs.cs.ualberta.ca/~sutton/book/ebook/numeqtmp14.png)

## Markov Property Consider the task of balancing a pole ![Pole balancing](https://webdocs.cs.ualberta.ca/~sutton/book/ebook/figtmp8.png)

## Markov Decision Processes RL problems that satisfy the Markov property are called Makov decission processes (MDPs) If the action and state space are finite, then it is a finite MDP, which is defined by Transition probabilities ![Transition probabilities](https://webdocs.cs.ualberta.ca/~sutton/book/ebook/numeqtmp15.png) and reward function ![Expected reward](https://webdocs.cs.ualberta.ca/~sutton/book/ebook/numeqtmp16.png)

## Value Functions Given that a policy, $\pi$, maps from a state-action pair to a probability of taking an action $\pi(s,a)$, the state-value function *under* policy $\pi$ can be written as ![State-value function](https://webdocs.cs.ualberta.ca/~sutton/book/ebook/numeqtmp17.png) and the action-value function can be written as ![Action-value function](https://webdocs.cs.ualberta.ca/~sutton/book/ebook/numeqtmp18.png) where $\gamma \in [0,1)$ is the discounting factor

## Optimal Value Functions Value functions represent the amount of reward aquired by a policy over the long term. We can express the optimal value function as the value function for the best policy $V^*(s) = max_{\pi} V^{\pi}(s)$

## Bellman Equation Let's look at the Bellman equation to ensure that our formulations are self-consistent [Bellman optimality equation](https://webdocs.cs.ualberta.ca/~sutton/book/ebook/node35.html)

## Iterative Policy Evaluation A policy must be evaluated to quantify behavior via its value function. ![Iterative policy evaluation](https://webdocs.cs.ualberta.ca/~sutton/book/ebook/pseudotmp0.png) Algorithm for calculating $V(s)$ for a polict $\pi$

## Iterative Policy Evaluation Consider a 4x4 gridworld example ![Gridworld example](https://webdocs.cs.ualberta.ca/~sutton/book/ebook/imgtmp4.png) Actions are: left, right, up, and down

## Iterative Policy Evaluation ![Gridworld example](https://webdocs.cs.ualberta.ca/~sutton/book/ebook/figtmp15.png)

## Policy Iteration The purpose of computing the value function is so we can find better policies. We are looking for a new policy, $\pi'$, such that $V^{\pi'}(s) \geq V^{\pi}(s)$ How can we achieve this?

## Policy Iteration ![Policy iteration](https://webdocs.cs.ualberta.ca/~sutton/book/ebook/pseudotmp1.png)

Temporal-Difference Learning

For an episodic or continuous task we use this formulation to update the value function

We express the sequence of rewards as the "return"

$R_t = r_t + \gamma * r_{t+1} + \gamma^2 * r_{t+2} ... + \gamma^T * r_{T}$

where $T$ is the terminal timestep

Using this, we can write the update to the value function as

## Temporal-Difference Learning We know that the value function for our next state encodes respective the expected return. This suggests that we can learn from the differences values of successive states: ![TD update](https://webdocs.cs.ualberta.ca/~sutton/book/ebook/numeqtmp29.png) This is the temporal-difference (TD) update rule. Similar behavior has been observed in dopamine neurons (Schultz, 1998)

## Temporal-Difference Learning We can also write a TD update rule for $Q(s,a)$ ![SARSA update](https://webdocs.cs.ualberta.ca/~sutton/book/ebook/numeqtmp30.png) Can we design an on-policy learning algorithm (i.e an algorithm that learns and uses what it has learned to make decisions)?

## SARSA The creatively names SARSA (state-action-reward-state-action) algorithm is an on-policy TD algorithm ![SARSA code](https://webdocs.cs.ualberta.ca/~sutton/book/ebook/pseudotmp8.png) Off-policy alternative to this algorithm are called "Q-learning"

## Eligibility Traces TD-learning only propagates information from the subsequent state, and may be slow. ![Backups](https://webdocs.cs.ualberta.ca/~sutton/book/ebook/figtmp36.png) How can we resolve this?

## Eligibility Traces Eligibility traces are an additional variable associated with each state, encoding the time-since-last-visit as ![eligibility trace math](https://webdocs.cs.ualberta.ca/~sutton/book/ebook/numeqtmp36.png) which may be visualized as ![Eligibility plot](https://webdocs.cs.ualberta.ca/~sutton/book/ebook/imgtmp15.png)

## Eligibility Traces We can adapt our learning algorithm to use these traces as ![TD lambda](https://webdocs.cs.ualberta.ca/~sutton/book/ebook/pseudotmp11.png)

## Function Approximation What about RL in a situation like the game of Go? There are $3^{19*19}$ states of a Go board. Can we continue to use a lookup table for our functions? [AlphaGo Slides](http://kephale.github.io/TuftsCOMP135_Spring2016/Lecture13/#/5/6)

## What Next? Game Theory and Retrospective