Kyle I S Harrington / kyle@eecs.tufts.edu
Some slides adapted from Roni Khardon and Tom Mitchell
Multiple models can often describe the same data,
How do we choose which one to use?
We may favor the model with lower error/higher accuracy
Even so, is the model with the lowest error on a particular set of data the one that we want?
$error_D(h) \equiv Pr_{x \in D} [f(x) \neq h(x)]$
$error_S(h) \equiv \frac{1}{n} \displaystyle \sum_{x \in S} \delta(f(x) \neq h(x))$
How well does $error_S(h)$ estimate $error_D(h)$?
Consider the case where $S$ is our training set:
$bias \equiv E[error_S(h)] - error_D(h)$
For any training set, we expect that bias will be negative because $h$ and $S$ are not independent
What other complications beyond $bias$ might arise?
Even if $S$ and $h$ are independent, $error_S(h)$ may vary from $error_D(h)$
$error_S(h)$ is a random variable
If we rerun with a different randomly drawn $S$, where $|S| = n$
$P( n * error_S(h) ) = P(r) = \frac{ n! } { r! ( n - r )! } p^r ( 1 - p )^{n-r}$,
where $p = error_D(h)$
A random variable, $Y$, has a value representing the outcome of an experiment (a coin lands heads up)
Probability distribution for a random variable is the probability $Pr(Y=y_i)$ that Y will be $y_i$ for each possible $y_i$
Expected value (aka mean) : $\mu_Y = E[Y] = \displaystyle \sum_i y_i Pr(Y=y_i)$
Variance : $Var(Y) = E[(Y-\mu_Y)^2]$ - How far the distribution spreads about the mean
Standard deviation : $\sigma_Y = \sqrt{ Var(Y) }$
Bell curve described by a mean and standard deviation
90% of area (probability) lies in $\mu \pm 1.64 \sigma$
N% of area (probability) lies in $\mu \pm z_N \sigma$
N% | 50% | 68% | 80% | 90% | 95% | 98% | 99% |
$z_N$ | 0.67 | 1.00 | 1.28 | 1.64 | 1.96 | 2.33 | 2.58 |
$x \in \mu \pm z_N \sigma$, translates to: "with probability N%"
$\mu \in x \pm z_N \sigma$, translates to: "with confidence N%"
Consider $x$, drawn from $\mathcal{N}(\mu,\sigma)$
We say: with confidence 95%, $\mu \in x \pm 1.96 \sigma$
If sample data $S$ contains $n \geq 30$ samples drawn independent of $h$ and each other,
Then with approx. 95% probability $error_S(h)$ lies in interval
$error_D(h) \pm 1.96 \sqrt{\frac{error_D(h) (1 - error_D(h))}{n}}$
Consider a set of independent, identically distributed random variables $Y_1 ... Y_n$, drawn from an arbitrary probability distribution with mean $\mu$ and finite variance $\sigma^2$. The sample mean is
$\bar{Y} \equiv \frac{1}{n} \displaystyle \sum^{n}_{i=1} Y_i$
$\bar{Y} \equiv \frac{1}{n} \displaystyle \sum^{n}_{i=1} Y_i$
As $n \rightarrow \infty$, the distribution governing $\bar{Y}$ approaches a Normal distribution with mean $\mu$ and variance $\frac{\sigma^2}{n}$
Yellow indicates the shape of underlying distribution (Weibull), blue bars are 50 averages of 1 sample
Yellow indicates the shape of underlying distribution, blue bars are 50 averages of 5 samples
Yellow indicates the shape of underlying distribution, blue bars are 50 averages of 25 samples
By applying the central limit theorem we can approximate a binomial distribution with a normal distribution
We can then say:
If we use a classifier on a test set with $n$ samples
We estimate the error rate as $\hat{p} = \frac{|incorrect|}{n}$
If $n \geq 30$, (use C.L.T.) $\hat{p}$ is approx. distributed as $\hat{p} = \mathcal{N}(\mu, \sigma)$
If $n \geq 30$, (use C.L.T.) $\hat{p}$ is approx. distributed as $\hat{p} = \mathcal{N}(\mu, \sigma)$
$\mu$ is $error_D(h)$, and $\sigma$ is $\sqrt{ \frac{error_D(h) (1-error_D(h)) }{n} }$
Oi, we're stuck with $error_D(h)$'s in our expression!
Consider 2 hypotheses $h_1$ and $h_2$, each tested on an independenly generated sample set from the same distribution
It would be interesting to know the difference in error
$d \equiv error_D(h_1) - error_D(h_2)$
Target: $d = error_D(h_1) - error_D(h_2)$
Estimate with: $\hat{d} \equiv error_{S_1}(h_1) - error_{S_2}(h_2)$
Estimate with: $\hat{d} \equiv error_{S_1}(h_1) - error_{S_2}(h_2)$
Use C.L.T., to estimate the distribution
$\sigma_{\hat{d}} = \sqrt{ \frac{error_{S_1}(h_1) (1 - error_{S_1}(h_1))}{ n_1} + \frac{ error_{S_2}(h_2) ( 1 - error_{S_2}(h_2) ) }{n_2} }$
Find the lower and upper limit of the interval such that N% of probability mass is within the interval:
$\hat{d} \pm z_N \sqrt{ \frac{error_{S_1}(h_1) (1 - error_{S_1}(h_1))}{ n_1} + \frac{ error_{S_2}(h_2) ( 1 - error_{S_2}(h_2) ) }{n_2} }$
Lookup in the table of $N%$ to $z_N$ values
Now consider hypotheses $h_A$ and $h_B$
Partition into k-folds, $T_1 ... T_k$, of equal size $\geq 30$
For $i$ from $1$ to $k$: $\delta_i \leftarrow error_{T_i}(h_A) - error_{T_i}(h_B)$
For $i$ from $1$ to $k$: $\delta_i \leftarrow error_{T_i}(h_A) - error_{T_i}(h_B)$
N% confidence interval estimate:
$\bar{ \delta } \pm t_{N,k-1} s_{\bar{\delta}}$
$s_{\bar{\delta}} \equiv \sqrt{ \frac{1}{k(k-1)} \displaystyle \sum_{i=1}^k (\delta_i - \bar{ \delta })^2 }$
Where did $t_{N,k-1}$ come from?
Now consider hypotheses $h_A$ and $h_B$
Target: $d = error_D(h_A) - error_D(h_B)$
N% confidence interval estimate:
$\bar{ \delta } \pm t_{N,k-1} s_{\bar{\delta}}$
$s_{\bar{\delta}} \equiv \sqrt{ \frac{1}{k(k-1)} \displaystyle \sum_{i=1}^k (\delta_i - \bar{ \delta })^2 }$
What did we just do?
Turn in hardcopy on 02/16
Features! (reading listed on site)