Introduction to Machine Learning and Data Mining

Kyle I S Harrington / kyle@eecs.tufts.edu







Some slides adapted from Roni Khardon and Tom Mitchell

How accurate is a hypothesis?

If we believe something about our data,

how do we express the accuracy of this belief?

How accurate is a hypothesis?

Previous morningPrevious dayPrevious nightEarly morningClosed?
LightLightLightHeavyTRUE
LightLightHeavyLightTRUE
HeavyHeavyLightLightFALSE
HeavyMediumMediumLightFALSE
MediumMediumMediumMediumTRUE
LightLightHeavyHeavyTRUE
LightHeavyHeavyMediumTRUE
HeavyMediumMediumLightFALSE
MediumMediumLightLightFALSE
LightLightLightLightFALSE
LightLightMediumLightFALSE
LightLightLightMediumTRUE

How accurate is this decision tree?

How accurate is a hypothesis?

Previous morningPrevious dayPrevious nightEarly morningClosed?
LightLightLightHeavyTRUE
LightLightHeavyLightTRUE
HeavyHeavyLightLightFALSE
HeavyMediumMediumLightFALSE
MediumMediumMediumMediumTRUE
LightLightHeavyHeavyTRUE
LightHeavyHeavyMediumTRUE
HeavyMediumMediumLightFALSE
MediumMediumLightLightFALSE
LightLightLightLightFALSE
LightLightMediumLightFALSE
LightLightLightMediumTRUE

Correct: 11, Incorrect: 1

Confusion Matrix

+-Prediction
+TPFN
-FPTN
Reality

Accuracy

+-Prediction
+TPFN
-FPTN
Reality

$Acc = \frac{TP+TN}{TP+TN+FP+FN}$

Accuracy

$Acc = \frac{TP+TN}{TP+TN+FP+FN}$

What if we calculated a snowday predictor over the entire academic year?

Should we optimize with respect to $Acc$?

Precision and Recall

In information retrieval, we search to satisfy some $Query$

Goal is to optimize positive query responses

$Precision = \frac{TP}{TP+FP}$

$Recall = \frac{TP}{TP+FN}$

This is often combined into the F-score

$F = \frac{2 * P * R}{P + R}$

Sensitivity and Specificity

In medicine, we want accurate diagnoses

$Sensitivity = \frac{TP}{TP+FN}$ (same as recall)

$Specificity = \frac{TN}{TN+FP}$

How "accurate" the trues and falses are

True and False Positives

In signal detection, we want to find a signal within data

$TPrate = \frac{TP}{TP+FN} = recall$

$FPrate = \frac{FP}{TN+FP} = 1 - Specificity$

Often represented graphically as ROC (receiver operator characteristic) curve plots

How accurate is a hypothesis?

Previous morningPrevious dayPrevious nightEarly morningClosed?
LightLightLightHeavyTRUE
LightLightHeavyLightTRUE
HeavyHeavyLightLightFALSE
HeavyMediumMediumLightFALSE
MediumMediumMediumMediumTRUE
LightLightHeavyHeavyTRUE
LightHeavyHeavyMediumTRUE
HeavyMediumMediumLightFALSE
MediumMediumLightLightFALSE
LightLightLightLightFALSE
LightLightMediumLightFALSE
LightLightLightMediumTRUE

Let's consider a linear regression:

$Closed = w_1 * P_m + w_2 * P_d + w_3 * P_n + w_4 * E_m$


We need everything to be a number, so lets find-and-replace:

Light = 0, Medium = 1, Heavy = 2

TRUE = 1, FALSE = 0

How accurate is a hypothesis?

Previous morningPrevious dayPrevious nightEarly morningClosed?
LightLightLightHeavyTRUE
LightLightHeavyLightTRUE
HeavyHeavyLightLightFALSE
HeavyMediumMediumLightFALSE
MediumMediumMediumMediumTRUE
LightLightHeavyHeavyTRUE
LightHeavyHeavyMediumTRUE
HeavyMediumMediumLightFALSE
MediumMediumLightLightFALSE
LightLightLightLightFALSE
LightLightMediumLightFALSE
LightLightLightMediumTRUE

Let's consider a linear regression:

$Closed = w_1 * P_m + w_2 * P_d + w_3 * P_n + w_4 * E_m$

Previous morning, $w_1 = -0.18649558$

Previous day, $w_2 = 0.07225614$

Previous night, $w_3 = 0.15659649$

Early morning, $w_4 = 0.39815622$

How accurate is a hypothesis?

Previous morningPrevious dayPrevious nightEarly morningClosed?
LightLightLightHeavyTRUE
LightLightHeavyLightTRUE
HeavyHeavyLightLightFALSE
HeavyMediumMediumLightFALSE
MediumMediumMediumMediumTRUE
LightLightHeavyHeavyTRUE
LightHeavyHeavyMediumTRUE
HeavyMediumMediumLightFALSE
MediumMediumLightLightFALSE
LightLightLightLightFALSE
LightLightMediumLightFALSE
LightLightLightMediumTRUE

Let's consider a linear regression:

$Closed = w_1 * P_m + w_2 * P_d + w_3 * P_n + w_4 * E_m$

$Closed = w_1 * Light + w_2 * Light$

      $+ w_3 * Light + w_4 * Heavy$

$1.01 = -0.19 * 0 + 0.07 * 0$

      $+ 0.16 * 0 + 0.40 * 2$

How accurate is a hypothesis?

Previous morningPrevious dayPrevious nightEarly morningClosed?
LightLightLightHeavyTRUE
LightLightHeavyLightTRUE
HeavyHeavyLightLightFALSE
HeavyMediumMediumLightFALSE
MediumMediumMediumMediumTRUE
LightLightHeavyHeavyTRUE
LightHeavyHeavyMediumTRUE
HeavyMediumMediumLightFALSE
MediumMediumLightLightFALSE
LightLightLightLightFALSE
LightLightMediumLightFALSE
LightLightLightMediumTRUE

Let's consider a linear regression:

$Closed = w_1 * P_m + w_2 * P_d + w_3 * P_n + w_4 * E_m$

$Closed = w_1 * Heavy + w_2 * Heavy$

      $+ w_3 * Light + w_4 * Light$

$-0.02 = -0.19 * 2 + 0.07 * 2$

      $+ 0.16 * 0 + 0.40 * 0$

ROC Curve

When we did a regression we created an equation of the form:

$Closed = w_1 * P_m + w_2 * P_d + w_3 * P_n + w_4 * E_m$

$Closed$ is a number that we convert into True/False

ROC Curve

$Closed = w_1 * P_m + w_2 * P_d + w_3 * P_n + w_4 * E_m$

Example values of $Closed$ might actually be:

Reality111000
Predictions0.80.50.450.5-0.35-0.05

ROC Curve

Reality111000
Predictions0.80.50.450.5-0.35-0.05

ROC Curves

An ROC curve taken over a larger test set.

Different types of accuracy

Sample error - Error measured on a subset of data

True error - Error measured on the underlying dataset (probably can't be measured)

Sample error

If we measured error by using the decision tree for each day this week, then we would have a sample error

True error

If there was no noise in deciding snow days and we had data for all permutations of attribute-values:

Previous morning, Previous day, Previous night, Early morning

Validation Datasets

  • Partition the dataset into training and testing
  • Train model on training set
  • Measure performance on testing set

If we repeat this $N$ times for some ML algorithm and average performance,

Then we might get a decent estimate of the performance

What could go wrong?

Validation Datasets

What could go wrong?

We might have correlation between our train/test sets across runs

Cross-Validation

  • Divide data into $k$ subsets (called folds)
  • For each k:
    • Train on all folds except $k$
    • Test on $k$
  • Report average performance over all $k$

Test sets do not overlap!

Cross-Validation

Consider a dataset where $N=100$, split into 4 folds

Stratified Cross-Validation

  • Split classes into sets
  • Partition class sets into $k$ folds
  • Join folds from each class

Leave One Out

k-fold cross-validation, where $k=N$

  • Effective
  • High variance for each fold test
  • Expensive

Cross-validation

Wait, why did we talk about this?

Cross-validation

Cross-validation gives a more reliable estimate of algorithm performance

We can use this to compare ML algorithms!

What Next?

Model selection

Then features!