Introduction to Machine Learning and Data Mining

Kyle I S Harrington / kyle@eecs.tufts.edu

Some slides adapted from Roni Khardon and Tom Mitchell

How accurate is a hypothesis?

If we believe something about our data,

how do we express the accuracy of this belief?

How accurate is a hypothesis?

Previous morning	Previous day	Previous night	Early morning	Closed?
Light	Light	Light	Heavy	TRUE
Light	Light	Heavy	Light	TRUE
Heavy	Heavy	Light	Light	FALSE
Heavy	Medium	Medium	Light	FALSE
Medium	Medium	Medium	Medium	TRUE
Light	Light	Heavy	Heavy	TRUE
Light	Heavy	Heavy	Medium	TRUE
Heavy	Medium	Medium	Light	FALSE
Medium	Medium	Light	Light	FALSE
Light	Light	Light	Light	FALSE
Light	Light	Medium	Light	FALSE
Light	Light	Light	Medium	TRUE

How accurate is this decision tree?

How accurate is a hypothesis?

Previous morning	Previous day	Previous night	Early morning	Closed?
Light	Light	Light	Heavy	TRUE
Light	Light	Heavy	Light	TRUE
Heavy	Heavy	Light	Light	FALSE
Heavy	Medium	Medium	Light	FALSE
Medium	Medium	Medium	Medium	TRUE
Light	Light	Heavy	Heavy	TRUE
Light	Heavy	Heavy	Medium	TRUE
Heavy	Medium	Medium	Light	FALSE
Medium	Medium	Light	Light	FALSE
Light	Light	Light	Light	FALSE
Light	Light	Medium	Light	FALSE
Light	Light	Light	Medium	TRUE

Correct: 11, Incorrect: 1

Confusion Matrix

	+	-	Prediction
+	TP	FN
-	FP	TN
Reality

Accuracy

	+	-	Prediction
+	TP	FN
-	FP	TN
Reality

$Acc = \frac{TP+TN}{TP+TN+FP+FN}$

Accuracy

$Acc = \frac{TP+TN}{TP+TN+FP+FN}$

What if we calculated a snowday predictor over the entire academic year?

Should we optimize with respect to $Acc$?

Precision and Recall

In information retrieval, we search to satisfy some $Query$

Goal is to optimize positive query responses

$Precision = \frac{TP}{TP+FP}$

$Recall = \frac{TP}{TP+FN}$

This is often combined into the F-score

$F = \frac{2 * P * R}{P + R}$

Sensitivity and Specificity

In medicine, we want accurate diagnoses

$Sensitivity = \frac{TP}{TP+FN}$ (same as recall)

$Specificity = \frac{TN}{TN+FP}$

How "accurate" the trues and falses are

True and False Positives

In signal detection, we want to find a signal within data

$TPrate = \frac{TP}{TP+FN} = recall$

$FPrate = \frac{FP}{TN+FP} = 1 - Specificity$

Often represented graphically as ROC (receiver operator characteristic) curve plots

How accurate is a hypothesis?

Previous morning	Previous day	Previous night	Early morning	Closed?
Light	Light	Light	Heavy	TRUE
Light	Light	Heavy	Light	TRUE
Heavy	Heavy	Light	Light	FALSE
Heavy	Medium	Medium	Light	FALSE
Medium	Medium	Medium	Medium	TRUE
Light	Light	Heavy	Heavy	TRUE
Light	Heavy	Heavy	Medium	TRUE
Heavy	Medium	Medium	Light	FALSE
Medium	Medium	Light	Light	FALSE
Light	Light	Light	Light	FALSE
Light	Light	Medium	Light	FALSE
Light	Light	Light	Medium	TRUE

Let's consider a linear regression:

$Closed = w_1 * P_m + w_2 * P_d + w_3 * P_n + w_4 * E_m$

We need everything to be a number, so lets find-and-replace:

Light = 0, Medium = 1, Heavy = 2

TRUE = 1, FALSE = 0

How accurate is a hypothesis?

Previous morning	Previous day	Previous night	Early morning	Closed?
Light	Light	Light	Heavy	TRUE
Light	Light	Heavy	Light	TRUE
Heavy	Heavy	Light	Light	FALSE
Heavy	Medium	Medium	Light	FALSE
Medium	Medium	Medium	Medium	TRUE
Light	Light	Heavy	Heavy	TRUE
Light	Heavy	Heavy	Medium	TRUE
Heavy	Medium	Medium	Light	FALSE
Medium	Medium	Light	Light	FALSE
Light	Light	Light	Light	FALSE
Light	Light	Medium	Light	FALSE
Light	Light	Light	Medium	TRUE

Let's consider a linear regression:

$Closed = w_1 * P_m + w_2 * P_d + w_3 * P_n + w_4 * E_m$

Previous morning, $w_1 = -0.18649558$

Previous day, $w_2 = 0.07225614$

Previous night, $w_3 = 0.15659649$

Early morning, $w_4 = 0.39815622$

How accurate is a hypothesis?

Previous morning	Previous day	Previous night	Early morning	Closed?
Light	Light	Light	Heavy	TRUE
Light	Light	Heavy	Light	TRUE
Heavy	Heavy	Light	Light	FALSE
Heavy	Medium	Medium	Light	FALSE
Medium	Medium	Medium	Medium	TRUE
Light	Light	Heavy	Heavy	TRUE
Light	Heavy	Heavy	Medium	TRUE
Heavy	Medium	Medium	Light	FALSE
Medium	Medium	Light	Light	FALSE
Light	Light	Light	Light	FALSE
Light	Light	Medium	Light	FALSE
Light	Light	Light	Medium	TRUE

Let's consider a linear regression:

$Closed = w_1 * P_m + w_2 * P_d + w_3 * P_n + w_4 * E_m$

$Closed = w_1 * Light + w_2 * Light$

$+ w_3 * Light + w_4 * Heavy$

$1.01 = -0.19 * 0 + 0.07 * 0$

$+ 0.16 * 0 + 0.40 * 2$

How accurate is a hypothesis?

Previous morning	Previous day	Previous night	Early morning	Closed?
Light	Light	Light	Heavy	TRUE
Light	Light	Heavy	Light	TRUE
Heavy	Heavy	Light	Light	FALSE
Heavy	Medium	Medium	Light	FALSE
Medium	Medium	Medium	Medium	TRUE
Light	Light	Heavy	Heavy	TRUE
Light	Heavy	Heavy	Medium	TRUE
Heavy	Medium	Medium	Light	FALSE
Medium	Medium	Light	Light	FALSE
Light	Light	Light	Light	FALSE
Light	Light	Medium	Light	FALSE
Light	Light	Light	Medium	TRUE

Let's consider a linear regression:

$Closed = w_1 * P_m + w_2 * P_d + w_3 * P_n + w_4 * E_m$

$Closed = w_1 * Heavy + w_2 * Heavy$

$+ w_3 * Light + w_4 * Light$

$-0.02 = -0.19 * 2 + 0.07 * 2$

$+ 0.16 * 0 + 0.40 * 0$

ROC Curve

When we did a regression we created an equation of the form:

$Closed = w_1 * P_m + w_2 * P_d + w_3 * P_n + w_4 * E_m$

$Closed$ is a number that we convert into True/False

ROC Curve

$Closed = w_1 * P_m + w_2 * P_d + w_3 * P_n + w_4 * E_m$

Example values of $Closed$ might actually be:

Reality	1	1	1	0	0	0
Predictions	0.8	0.5	0.45	0.5	-0.35	-0.05

ROC Curve

Reality	1	1	1	0	0	0
Predictions	0.8	0.5	0.45	0.5	-0.35	-0.05

ROC Curves

An ROC curve taken over a larger test set.

Different types of accuracy

Sample error - Error measured on a subset of data

True error - Error measured on the underlying dataset (probably can't be measured)

Sample error

If we measured error by using the decision tree for each day this week, then we would have a sample error

True error

If there was no noise in deciding snow days and we had data for all permutations of attribute-values:

Previous morning, Previous day, Previous night, Early morning

Validation Datasets

Partition the dataset into training and testing
Train model on training set
Measure performance on testing set

If we repeat this $N$ times for some ML algorithm and average performance,

Then we might get a decent estimate of the performance

What could go wrong?

Validation Datasets

What could go wrong?

We might have correlation between our train/test sets across runs

Cross-Validation

Divide data into $k$ subsets (called folds)
For each k:
- Train on all folds except $k$
- Test on $k$
Report average performance over all $k$

Test sets do not overlap!

Cross-Validation

Consider a dataset where $N=100$, split into 4 folds

Stratified Cross-Validation

Split classes into sets
Partition class sets into $k$ folds
Join folds from each class

Leave One Out

k-fold cross-validation, where $k=N$

Effective
High variance for each fold test
Expensive

Cross-validation

Wait, why did we talk about this?

Cross-validation

Cross-validation gives a more reliable estimate of algorithm performance

We can use this to compare ML algorithms!

What Next?

Model selection

Then features!