Kyle I S Harrington / kyle@eecs.tufts.edu
Some slides adapted from Roni Khardon and Tom Mitchell
If we believe something about our data,
how do we express the accuracy of this belief?
|
![]() |
How accurate is this decision tree?
|
![]() |
Correct: 11, Incorrect: 1
+ | - | Prediction | |
+ | TP | FN | |
- | FP | TN | |
Reality |
+ | - | Prediction | |
+ | TP | FN | |
- | FP | TN | |
Reality |
$Acc = \frac{TP+TN}{TP+TN+FP+FN}$
$Acc = \frac{TP+TN}{TP+TN+FP+FN}$
What if we calculated a snowday predictor over the entire academic year?
Should we optimize with respect to $Acc$?
In information retrieval, we search to satisfy some $Query$
Goal is to optimize positive query responses
$Precision = \frac{TP}{TP+FP}$
$Recall = \frac{TP}{TP+FN}$
This is often combined into the F-score
$F = \frac{2 * P * R}{P + R}$
In medicine, we want accurate diagnoses
$Sensitivity = \frac{TP}{TP+FN}$ (same as recall)
$Specificity = \frac{TN}{TN+FP}$
How "accurate" the trues and falses are
In signal detection, we want to find a signal within data
$TPrate = \frac{TP}{TP+FN} = recall$
$FPrate = \frac{FP}{TN+FP} = 1 - Specificity$
Often represented graphically as ROC (receiver operator characteristic) curve plots
|
Let's consider a linear regression: $Closed = w_1 * P_m + w_2 * P_d + w_3 * P_n + w_4 * E_m$ We need everything to be a number, so lets find-and-replace: Light = 0, Medium = 1, Heavy = 2 TRUE = 1, FALSE = 0 |
|
Let's consider a linear regression: $Closed = w_1 * P_m + w_2 * P_d + w_3 * P_n + w_4 * E_m$ Previous morning, $w_1 = -0.18649558$ Previous day, $w_2 = 0.07225614$ Previous night, $w_3 = 0.15659649$ Early morning, $w_4 = 0.39815622$ |
|
Let's consider a linear regression: $Closed = w_1 * P_m + w_2 * P_d + w_3 * P_n + w_4 * E_m$ $Closed = w_1 * Light + w_2 * Light$ $+ w_3 * Light + w_4 * Heavy$ $1.01 = -0.19 * 0 + 0.07 * 0$ $+ 0.16 * 0 + 0.40 * 2$ |
|
Let's consider a linear regression: $Closed = w_1 * P_m + w_2 * P_d + w_3 * P_n + w_4 * E_m$ $Closed = w_1 * Heavy + w_2 * Heavy$ $+ w_3 * Light + w_4 * Light$ $-0.02 = -0.19 * 2 + 0.07 * 2$ $+ 0.16 * 0 + 0.40 * 0$ |
When we did a regression we created an equation of the form:
$Closed = w_1 * P_m + w_2 * P_d + w_3 * P_n + w_4 * E_m$
$Closed$ is a number that we convert into True/False
$Closed = w_1 * P_m + w_2 * P_d + w_3 * P_n + w_4 * E_m$
Example values of $Closed$ might actually be:
Reality | 1 | 1 | 1 | 0 | 0 | 0 |
Predictions | 0.8 | 0.5 | 0.45 | 0.5 | -0.35 | -0.05 |
Reality | 1 | 1 | 1 | 0 | 0 | 0 |
Predictions | 0.8 | 0.5 | 0.45 | 0.5 | -0.35 | -0.05 |
An ROC curve taken over a larger test set.
Sample error - Error measured on a subset of data
True error - Error measured on the underlying dataset (probably can't be measured)
If we measured error by using the decision tree for each day this week, then we would have a sample error
If there was no noise in deciding snow days and we had data for all permutations of attribute-values:
Previous morning, Previous day, Previous night, Early morning
If we repeat this $N$ times for some ML algorithm and average performance,
Then we might get a decent estimate of the performance
What could go wrong?
What could go wrong?
We might have correlation between our train/test sets across runs
Test sets do not overlap!
Consider a dataset where $N=100$, split into 4 folds
k-fold cross-validation, where $k=N$
Wait, why did we talk about this?
Cross-validation gives a more reliable estimate of algorithm performance
We can use this to compare ML algorithms!
Model selection
Then features!