Kyle I S Harrington / kyle@eecs.tufts.edu
2 players, 19 by 19 board
$10^{761}$ possible games (# chess games $\leq 40$ moves $\approx 10^{43}$)
Goal: encircle opponent's pieces to claim territory
Uses convolutional neural networks for processing/representing the game
Trained with expert data and self-play using reinforcement learning
Day 2 of competition - start
What are the machine learning questions in gameplay?
What are the machine learning questions in gameplay?
Game AI and reinforcement learning use an agent-centric terminology
In 2 player, zero-sum games both players want to win
From some state of the game, we can predict a sequence of alternating actions where
Image from Maschelos at English Wikipedia
MCTS is a randomized algorithm to sample possible outcomes
Idea: Simulate game play from a relevant board state and store outcome. Do this many times.
But what if the game tree is huge?
There are about $10^{761}$ possible paths in the Go game tree
A Go board is basically an image (2D pixels with value: empty/black/white)
Input neurons in CNNs have "receptive fields" that cover patches of an input
Image from UFLDL, Stanford
Convolution is the process of taking a kernel, sliding it over an input image, and taking an innner product
Image from UFLDL, Stanford
Pooling is an aggregation over a pool of units/neurons
Image from UFLDL, Stanford
Generalized logistic function, to squash K-dimensional values
Choose action $a$ with probability:
$\frac{e^{Q_t(a) / \tau}}{\sum^K_{b=1} e^{Q_t(b)/ \tau}}$
where $Q_t(a)$ is the value of action $a$ at time $t$
Normal neural network training methods apply
Image from Silver et al, 2016
Agent transitions through states by making actions, while receiving rewards
Image from the RL book by Sutton and Barto
In RL, agents attempt to maximize reward obtained in the long-term
Rewards can be described as a summed sequence:
$R_t = r_{t} + r_{t+1} + r_{t+2} + ... + r_{T}$
The core of most RL algorithms is to estimate a value function:
$V^{\pi} (s) = E_{\pi} \{ R_t | s_t = s \}$
Learning the value function $V^{\pi}$ is accomplished by trial-and-error and reinforcement via reward signal
Dynamic programming is used to do this
We'll cover specifics in the RL lectures
Take home exam, due March 29
Will cover everything from kNN to Clustering and Gaussian mixture models (next Tuesday's class)
More unsupervised learning