Introduction to Machine Learning and Data Mining

Kyle I S Harrington / kyle@eecs.tufts.edu

Instance-based Learning

Easy! Just use the existing data for prediction.

How?

The problem

Notebook

k-Nearest Neighbors

Lazy classifier: only does work when classifying

Parametric: must choose a 'k' parameter

In the limit of fully-sampled data, can be fairly accurate

kNN: Algorithm

Training: N/A

Classification:

For each new observation to classify

  • Find k nearest neighbors
  • Label observation with majority class of neighbors

Note: distance is generally Euclidean

kNN: Limitations

Speed of classification

Choosing K

Noisy data

Nonuniform samples

kNN: Exercise

Notebook

Assignment 2:

Submit a pull request on Github for kNN with exhaustive search

KD-trees

kNN isn't lazy enough: O(n) work is for chumps!

Let's organize our data into a tree to get O(log n) lookups

KD-tree: Algorithm

Constructing a KD-tree


numDimensions : comes from dataset	    
def buildTree(points, depth):
  axis = depth % numDimensions
  pointSort = sorted([{points[i][axis]:i} for i in range(len(points))])
  splitIdx = pointSort[int(len(points)/2)]
  tree = {}
  tree['location'] = points[splitIdx]
  tree['left'] = buildTree( points[:(splitIdx-1)], depth + 1 )
  tree['right'] = buildTree( points[splitIdx:], depth + 1 )
  return tree  
	  

Exercise: KD-trees

Notebook

Interlude

Quick demo of real-time KD-trees (brevis.example.swarm.neighborhood-line-swarm)

Generalization

What if our data isn't well-sampled?

What if we have to classify an alien (presumably we haven't seen those before)?

Assignment 1

Posted in the assignments section

Quick Weka recap

Due: Jan 27

Assignment 2

Posted in the assignments section

Due: Feb 03

What Comes Next?

Decision Trees