Introduction to Machine Learning and Data Mining




Neurocomputing

Kyle I S Harrington / kyle@eecs.tufts.edu







Some slides adapted from Geoffrey Hinton, UToronto; Tommi Jaakola, MIT.

Starting with Neuroscience

Drawing of Purkinje cells (A) and granule cells (B) from pigeon cerebellum by Santiago Ramón y Cajal, 1899; Instituto Cajal, Madrid, Spain. Public domain.

Linear Threshold Units

Image from Margaret Krause, UNL.

From Growth to Learning

Hebb's learning rule: fire together, wire together










Hebbian Learning

$\Delta w_{kj} = \eta x_j y_k$

change weight $i$ proportinally to the product of the input and the output










Hebbian Learning

$\Delta w_{kj} = \eta x_j y_k$

Problems?










Supervised Method

How do we find weights that can produce a particular output?










Hinton's Fish and Chips

  • Diet of multiple portions of fish, chips, and ketchup
  • Cashier only gives total price of meal









Hinton's Fish and Chips

  • Start with random guesses for the price of each portion
  • After multiple days, should be able to know prices of individual portions









Delta-rule

$\Delta w_{kj} = \eta ( t_k - o_k ) i_j$










Gradient Descent

start with some $w_{(0)}$
do:
     $w^{(t+1)} = w^{(t)} - \eta_t \nabla F(w^{(t)})$
until $||w^{(t+1)}-w^{(t)}|| \leq \epsilon$

Learning Rate

$w^{(t+1)} = w^{(t)} - \eta_t \nabla F(w^{(t)})$

Images from Genevieve Orr

Stochastic Gradient Descent

Given a stream of input/output samples

Update weights after each sample (online learning)

Stochastic Gradient Descent

Notebook!

Binary Classification

Predict a binary class, $y$, given a feature value, $x$

$P(y=1|x) > P(y=0|x)$

Log-odds Ratio

Rewrite $P(y=1|x) > P(y=0|x)$

As log-odds: $log \frac{P(y=1|\textbf{x})}{P(y=0|\textbf{x})} > 0$

Log-odds Ratio

$P(y|\textbf{x})$ is generally either unknown or estimated from samples

Linear approximation $log \frac{P(y=1|\textbf{x})}{P(y=0|\textbf{x})}$ as

$f(x;w) = w_o + x \cdot w_1$

Log-odds to Sigmoid

$log \frac{P(y=1|\textbf{x})}{P(y=0|\textbf{x})} = w_o + x \cdot w_1$

Instead we use

$P(y=1|x_i, w_{0,i} , w_{1,i} ) = g( w_{0,i} + x \cdot w_{1,i} )$

$g(z) = \frac{1}{(1+e^{-z})}$

Logistic Function

"Squishing function"

$f(x) = \frac{1}{1+e^{-x}}$

Regularization

Learn the weights with a penalty, $w$:

$\text{argmax}_{w} \displaystyle \sum^m_{i=1} P(y_i=1|x_i, w ) - \alpha R(w) $

Regularization Term

Regularization term, $R(w)$, forces parameters to be small when $\alpha>0$

L1: $R(w) = ||w||_1 = \displaystyle \sum^n_{i=1} |w_i|$

L2: $R(w) = ||w||_2 = \displaystyle \sum^n_{i=1} w_i^2$

L2 tends to be better at shrinking weights

Final Project Proposal

Due March 7 (Monday)

An example proposal

What Next?

Clustering