Introduction to Machine Learning and Data Mining

Kyle I S Harrington / kyle@eecs.tufts.edu







Some slides adapted from Tom Mitchell and Ryan Adams

Weka

A powerful ML package with long-history

Implements many of the algorithms we will study

Using Weka early on helps get an idea of what ML can do

Course Logistics

We'll be sticking to the Mitchell textbook for the most part

My office hours: by appointment, generally Thursday is best

k-Nearest Neighbors

Given a dataset $D$ with $N$ observations and $p$ dimensions

Every point $i$ is expressed as:

$x^i = ( x^i_1, x^i_2, ..., x^i_p )$

(Euclidean) distance between 2 points $i$ and $j$:

$d(x^i,x^j) = ( \displaystyle \sum^p_{m=1} ( x^i_m - x^j_m )^2 )^{1/2}$

k-Nearest Neighbors

For some unseen observation $\hat{x}$

$k=5$

Decision Boundaries kNN

Decision boundaries are defined over the Voronoi diagram

Image by Cynthia Rudin

k-Nearest Neighbors

For some unseen observation $\hat{x}$

$k=9$

Do we want the majority in this case?

Distance-weighted kNN

idxXY$d(\hat{x},x_i)$Blue?
10.5490971861620.3626773868320.0884316415625False
20.6553909653030.5487259808810.158017656472False
30.4857826405550.2498289886610.205057654261False
40.4519118966270.6589800378220.224393254589True
50.3052196825140.5073706217770.234428626804True
60.4299956336840.6677519732080.241064874734True
70.6198411024630.2043666987010.260279623299False
80.2540193910380.4807440296550.28012416478True
90.2627763350890.550808112490.288019748099True

Blue distances: 1.26803066901
Red distances: 0.711786575594

kNN in Practice

Housing prices

Attributes of houses:

  • location
  • bedrooms, bathrooms
  • square footage

What if we used the hexidecimal color of the house?

Regression with kNN

For some house that hasn't gone on the market $\hat{x}$

IndexXY$d(\hat{x},x_i)$House value
10.5490971861620.3626773868320.0884316415625$67,807.65
20.6553909653030.5487259808810.158017656472$165,240.34
30.4857826405550.2498289886610.205057654261$83,034.13
40.4519118966270.6589800378220.224393254589$275,335.16
50.3052196825140.5073706217770.234428626804$334,449.68

Estimated value of house $\hat{x}$:$185,173.39

An Ode to ID3

An ID3

Decision tree

Is built greedily

So either stop soon*

Or eventually prune

Otherwise you're probably overfitting

*where soon is statistically significant

Decision Trees

Handling Continuous Values

Make it discrete!

$(Temperature > \frac{( 48 + 60 )}{2} )$

Consider each boundary (i.e. $\frac{a+b}{2}$)

Use information gain to choose node as usual

Handling Missing Values

Some observations may not have values for all attributes

That's OK, we'll use it anyway

Multiple options:

  • When we get to the relevant node, $N$, assign the most common value of $A$ at $N$
  • Assign most common value of $A$ at $N$ that maps to class $C$
  • Use probabilities based on distribution of $A$ at $N$

Overfitting

Reduced-error pruning

  • Build a tree as usual, potentially overfitting
  • Use a validation dataset
  • Greedily remove nodes that improve the accuracy on the validation data

Rule Post-Pruning

$(Outlook=Sunny \wedge Humidity=High) \implies No$

Rule Post-Pruning

Grow tree, allowing it to overfit

Convert a tree to a collection of rules

Remove each precondition that improves accuracy (on validation set)

Sort rules by estimated accuracy, and maintain sorted order for classification

Rules from Tree

$(Outlook=Sunny \wedge Humidity=High) \implies No$

$(Outlook=Sunny \wedge Humidity=Low) \implies Play$

$(Outlook=Overcast) \implies Play$

$(Outlook=Rain \wedge Wind=Weak) \implies Play$

$(Outlook=Rain \wedge Wind=Strong) \implies No$

Pruning Preconditions

$(Outlook=Rain \wedge Wind=Weak) \implies Play$

$(Outlook=Rain \wedge Wind=Strong) \implies No$

OutlookTempHumidityWindyPlay
RainyLowHighWeakNo
RainyHighHighStrongNo

Pruning Preconditions

$(Outlook=Rain \wedge Wind=Weak) \implies Play$

$(Outlook=Rain \wedge Wind \neq Strong) \implies No$

OutlookTempHumidityWindyPlay
RainyLowHighWeakNo
RainyHighHighStrongNo

Pruning Preconditions

$(Outlook=Rain \wedge Wind=Weak) \implies Play$

$(Outlook=Rain) \implies No$

OutlookTempHumidityWindyPlay
RainyLowHighWeakNo
RainyHighHighStrongNo

Sorting Rules by Accuracy

$(Outlook=Rain) \implies No$

$(Outlook=Rain \wedge Wind=Weak) \implies Play$

OutlookTempHumidityWindyPlay
RainyLowHighWeakNo
RainyHighHighStrongNo

Sorted rules

$(Outlook=Sunny \wedge Humidity=High) \implies No$

$(Outlook=Sunny \wedge Humidity=Low) \implies Play$

$(Outlook=Overcast) \implies Play$

$(Outlook=Rain) \implies No$

$(Outlook=Rain \wedge Wind=Weak) \implies Play$

OutlookTempHumidityWindyPlay
RainyLowHighWeakNo
RainyHighHighStrongNo

Classification with rules

$(Outlook=Sunny \wedge Humidity=High) \implies No$

$(Outlook=Sunny \wedge Humidity=Low) \implies Play$

$(Outlook=Overcast) \implies Play$

$(Outlook=Rain) \implies No$

$(Outlook=Rain \wedge Wind=Weak) \implies Play$

OutlookTempHumidityWindyPlay
SunnyLowLowWeak?
RainyHighHighWeak?

Advantages of Rule Pruning

Why might one like rule pruning over reduced-error pruning?

Advantages of Rule Pruning

  • More specific than removing entire subtrees
  • Can remove distinctions near the root

Used in C4.5 (J48)

Growing a tree with chi-squared

Before making a split, test if the split is statistically significant

Proposing a split

Given training dataset $D$

Propose a split on attribute $A$

Notation:

  • $N_c$ is number of instances with class $c$
  • $D_x$ is data with value $x$ for attribute $A$
  • $N_x$ is the number of instances in $D_x$
  • $N_{xc}$ is the number of instances in $D_x$ with class $c$

Null Hypothesis

Null hypothesis: $A$ is irrelevant

The total proportion of class $c$ in $D$ is $N_c/N$

If the null hypothesis is true, then on average:

$\hat{N}_{xc} = \frac{N_c}{N} |D_x|$

Deviation from Null

Even if null hypothesis is true,

it will rarely be exactly $=$ to average

Measure deviation as:

$Dev=\displaystyle \sum_x \displaystyle \sum_c \frac{ ( N_{xc} - \hat{N}_{xc} )^2 } {\hat{N}_{xc}}$

Deviation from Null

Measure deviation as:

$Dev=\displaystyle \sum_x \displaystyle \sum_c \frac{ ( N_{xc} - \hat{N}_{xc} )^2 } {\hat{N}_{xc}}$

How far is our observed proportion from the expected proportion (based on the distribution within the dataset)

Using the Deviation

Deviation is the chi-squared statistic

The larger the deviation, the further we are from the null hypothesis that an attribute is irrelevant

If $Dev$ is small, then we don't want the branch

Using the Deviation

Put $Dev$ into a chi-square table

How many degrees-of-freedom?

$DF = ( ( \mbox{# of attribute-values} ) - 1 ) ( \mbox{# of classes} - 1 )$

*

*Wikipedia, By Geek3 - Own work, CC BY 3.0, $3

Using the Deviation

Chi-square table will give a probability

A large Dev leads to a small probability $\implies$ the pattern is rare relative to the null hypothesis

Split if $probability < \alpha$

What should $\alpha$ be? 0.05 is a default

Reflecting on chi-squared

Doesn't require separate validation data

Statistical tests become less valid with less data

$\alpha$ is a parameter

Assignment 2 is not required, just bonus

Posted in the assignments section

Due: Feb 03

What do you get? +10% on the first quiz

Final Projects

Proposal due: March 7

Study a novel dataset with an advanced algorithm

Extend a ML algorithm

Do a comparative study of multiple algorithms

Final Projects

Due: April 25

Turn in a write-up (8-12 pages)

  • Background on problem
  • Related work
  • Your method
  • Results
  • Conclusion and future work
  • References

Should have at least 10 references

If multiple people, then more work is expected

What Next?

Naive Bayes (you may want to skim some of the probability tutorials)