Kyle I S Harrington / kyle@eecs.tufts.edu
Some slides adapted from Tom Mitchell
$P(h|D) = \frac{ P(D|h) P(h) }{ P(D) }$
$P(h) =$ prior probability of hypothesis $h$
$P(D) =$ prior probability of training data $D$
$P(h|D) =$ probability of $h$ given $D$
$P(D|h) =$ probability of $D$ given $h$
$P(h) =$ prior probability of hypothesis $h$
$P(D) =$ prior probability of training data $D$
$P(D|h) =$ probability of $D$ given $h$
Describes how likely the data is given the hypothesis
$P(h|D) =$ probability of hypothesis $h$ given $D$
Describes how confident we are in $h$ after seeing our training data $D$
Why is this exciting?
Testing marbles during the gold rush
$P(gold) = 0.01$ and $P(junk) = 0.99$
Prof. Khardon's gold detector, $D$:
$P(D_{yes}|gold) = 0.98$ $P(D_{yes}|junk) = 0.04$
If a marble tests $D_{yes}$, should we buy it?
Adapted from Roni KhardonTesting marbles during the gold rush
$P(gold) = 0.01$ and $P(junk) = 0.99$
Prof. Khardon's gold detector, $D$:
$P(D_{yes}|gold) = 0.98$ $P(D_{yes}|junk) = 0.04$
$P(gold|D_{yes}) = P(gold) P(D_{yes}|gold)$
$P(junk|D_{yes}) = P(junk) P(D_{yes}|junk)$
Adapted from Roni KhardonTesting marbles during the gold rush
$P(gold) = 0.01$ and $P(junk) = 0.99$
Prof. Khardon's gold detector, $D$:
$P(D_{yes}|gold) = 0.98$ $P(D_{yes}|junk) = 0.04$
$P(gold|D_{yes}) = 0.01 * 0.98 = 0.0098 \implies 0.198$
$P(junk|D_{yes}) = 0.99 * 0.04 = 0.0396 \implies 0.802$
Adapted from Roni KhardonHow can we use Bayes theorem to search a hypothesis space?
$P(h|D) = \frac{ P(D|h) P(h) }{ P(D) }$
A hypothesis that maximizes $P(h|D)$
$h_{MAP} \equiv \displaystyle argmax_{h \in H} P(h|D)$
$= \displaystyle argmax_{h \in H} \frac{P(D|h)P(h)}{P(D)}$
$= \displaystyle argmax_{h \in H} P(D|h) P(h)$
$H$ is our hypothesis space
Brute force algorithm:
For each $h \in H$, calculate posterior probability:
$P(h|D) = \frac{P(D|h) P(h)}{P(D)}$
Then,
$h_{MAP} = argmax_{h \in H} P(h|D)$
We need to specify some values:
$P(h)$ and $P(D|h)$
Let's choose $P(h)$ and $P(D|h)$ based on the assumptions:
$D$ is noise-free
Target concept $c$ is contained in $H$
We have no a priori reason to favor any hypothesis
Let's start with $P(h)$
$c$ is contained in $H$ $\implies$ $\sum$ probabilities $ = 1$
We have no a priori reason to favor any hypothesis
$P(h) = \frac{1}{|H|}$ for all $h \in H$
Now $P(D|h)$
$D$ is noise-free
The probability of observing class $d_i$ given $h$ is 1 if $d_i=h(x_i)$, where $x_i$ is a tuple of attribute-values
otherwise, 0 if $d_i \neq h(x_i)$
Used to learn a function that maps a tuple of attribute values $(a_1, a_2, ... a_n)$ to a finite set of outputs $V$
Classify new instances to the MAP class/value given the attributes
Classification is the most probable value (MAP)
$v_{MAP} = argmax_{v_j \in V} P(v_j | a_1, a_2, ... a_n )$
$v_{MAP} = argmax_{v_j \in V} P(v_j | a_1, a_2, ... a_n )$
Rewrite according Bayes theorem
$v_{MAP} = argmax_{v_j \in V} \frac{ P(a_1, a_2, ... a_n | v_j ) P(v_j) }{ P(a_1, a_2, ... a_n ) }$
$v_{MAP} = argmax_{v_j \in V} P(a_1, a_2, ... a_n | v_j ) P(v_j)$
How do we get $P(a_1, a_2, ... a_n | v_j )$ and $P(v_j)$?
How do we get $P(a_1, a_2, ... a_n | v_j )$ and $P(v_j)$?
$P(v_j)$ is straight-forward
$P(v_j) = \frac{ |v_j \in D| }{ |D| }$, for dataset $D$
How do we get $P(a_1, a_2, ... a_n | v_j )$ and $P(v_j)$?
Why is estimating $P(a_1, a_2, ... a_n | v_j )$ the same way we did for $P(v_j)$ hard?
Estimating $P(a_1, a_2, ... a_n | v_j )$
This is where naive comes in
Naive Bayes assumes conditional independence!
Estimating $P(a_1, a_2, ... a_n | v_j )$
Conditional independence means the probability of observing a combination of attribute-values is the product of observing each attribute-value independently
$P(a_1, a_2, ... a_n | v_j ) = \displaystyle \prod_i P( a_i | v_j )$
Going back to our naive Bayes equation
$v_{MAP} = argmax_{v_j \in V} P(a_1, a_2, ... a_n | v_j ) P(v_j)$
Rewrite using our estimates:
$v_{NB} = argmax_{v_j \in V} P(v_j) \displaystyle \prod_i P( a_i | v_j )$
Should we play tennis today?
Let's call this: ( Outlook = rainy, Temperature = cool, Humidity = high, Wind = false )
Do we play tennis today?
( Outlook = rainy, Temperature = cool, Humidity = high, Wind = false )
$v_{NB} = argmax_{v_j \in V} P(v_j) \displaystyle \prod_i P( a_i | v_j )$
|
$P( play_{yes} ) = $ $P( play_{no} ) = $ |
|
$P( play_{yes} ) = \frac{9}{14}$ = 0.64 $P( play_{no} ) = \frac{5}{14} = 0.36$ |
Do we play tennis today?
( Outlook = rainy, Temperature = cool, Humidity = high, Windy = false )
$v_{NB} = argmax_{v_j \in V} P(v_j) \displaystyle \prod_i P( a_i | v_j )$
Now we have our $P(v_j)$s
|
$P( outlook_{rainy} | play_{yes} ) =$ $P( outlook_{rainy} | play_{no} ) =$ |
|
$P( outlook_{rainy} | play_{yes} ) = \frac{3}{9} = 0.33$ $P( outlook_{rainy} | play_{no} ) = \frac{2}{5} = 0.4$ |
|
$P( outlook_{rainy} | play_{yes} ) = \frac{3}{9} = 0.33$ $P( outlook_{rainy} | play_{no} ) = \frac{2}{5} = 0.4$ $P( temp_{cool} | play_{yes} ) = \frac{3}{9} = 0.33$ $P( temp_{cool} | play_{no} ) = \frac{1}{5} = 0.2$ $P( humidity_{high} | play_{yes} ) = \frac{3}{9} = 0.33$ $P( humidity_{high} | play_{no} ) = \frac{4}{5} = 0.8$ $P( windy_{false} | play_{yes} ) = \frac{6}{9} = 0.67$ $P( windy_{false} | play_{no} ) = \frac{2}{5} = 0.4$ |
Do we play tennis today? $v_{NB} = argmax_{v_j \in V} P(v_j) \displaystyle \prod_i P( a_i | v_j )$ |
$P( play_{yes} ) = \frac{9}{14}$ = 0.64 $P( play_{no} ) = \frac{5}{14} = 0.36$ $P( outlook_{rainy} | play_{yes} ) = \frac{3}{9} = 0.33$ $P( outlook_{rainy} | play_{no} ) = \frac{2}{5} = 0.4$ $P( temp_{cool} | play_{yes} ) = \frac{3}{9} = 0.33$ $P( temp_{cool} | play_{no} ) = \frac{1}{5} = 0.2$ $P( humidity_{high} | play_{yes} ) = \frac{3}{9} = 0.33$ $P( humidity_{high} | play_{no} ) = \frac{4}{5} = 0.8$ $P( windy_{false} | play_{yes} ) = \frac{6}{9} = 0.67$ $P( windy_{false} | play_{no} ) = \frac{2}{5} = 0.4$ |
Do we play tennis today? $v_{NB} = argmax_{v_j \in V} P(v_j) \displaystyle \prod_i P( a_i | v_j )$ $play_{yes}: P( play_{yes})$ $play_{yes} = 0.01541$ $play_{no} = 0.00922$ |
$P( play_{yes} ) = \frac{9}{14}$ = 0.64 $P( play_{no} ) = \frac{5}{14} = 0.36$ $P( outlook_{rainy} | play_{yes} ) = \frac{3}{9} = 0.33$ $P( outlook_{rainy} | play_{no} ) = \frac{2}{5} = 0.4$ $P( temp_{cool} | play_{yes} ) = \frac{3}{9} = 0.33$ $P( temp_{cool} | play_{no} ) = \frac{1}{5} = 0.2$ $P( humidity_{high} | play_{yes} ) = \frac{3}{9} = 0.33$ $P( humidity_{high} | play_{no} ) = \frac{4}{5} = 0.8$ $P( windy_{false} | play_{yes} ) = \frac{6}{9} = 0.67$ $P( windy_{false} | play_{no} ) = \frac{2}{5} = 0.4$ |
Text-to-speech: What is the most likely word to follow another?
Image analysis: What objects are near other objects?
Medical diagnostics: What patient outcomes are most probable given the result of some tests?
Proposal due: March 7
Study a novel dataset with an advanced algorithm
Extend a ML algorithm
Do a comparative study of multiple algorithms
Due: April 25
Turn in a write-up (8-12 pages)
Should have at least 10 references
If multiple people, then more work is expected
Measuring the success of a ML algorithm