Introduction to Machine Learning and Data Mining

Kyle I S Harrington / kyle@eecs.tufts.edu

Some slides adapted from Roni Khardon, Andrew Moore, Mohand Said Allili, David Sontag, Frank Dellaert, Michael Jordan, Yair Weiss

## Logistics Make up class: May 5? Quiz 2 moved to 04/12 Midterm Questions [Early handwriting recognition](http://jackschaedler.github.io/handwriting-recognition/)

## Limitations of K-means K-means has limitations for some data. ![K-means fails to separate 2 clusters](images/kmeans_k=2_incorrect.svg) What do we do?

## Probabilistic Membership We could assign cluster labels probabilistically using distance. $p(x \in \textbf{cluster i}) = \frac{ d(x,C_i) }{ \displaystyle \sum^{k}_j d(x,C_j) }$ where $d(x,C_i)$ is the distance from instance x to centroid of cluster i.

## Probabilistic Membership ![Ambiguity of K-means clear from distance map](images/kmeans_k=2_incorrect_dists.svg) Problem solved?

## Distribution Approximation - We have a set of samples that were drawn from some unknown distribution. - What if we want to know the probability that a sample came from a distribution? - We need to know or approximate the underlying distribution.

## Gaussian We know Gaussian's from 1D: $f(\mu,\sigma) = \frac{1}{\sigma \sqrt{2 \pi}} e^{ -\frac{1}{2} ( \frac{x-\mu}{\sigma})^2}$ ![Example 1D Gaussian. mean = 0, std = 1](images/example_gaussian_1D.svg)

## Gaussian Mixture Models Most distributions don't look like a single Gaussian. Use multiple Gaussians to describe a distribution. A Gaussian mixture model uses k-Gaussians, each with a weight, mean, and the entire model uses a covariance matrix.

## Gaussian Mixture Models - k Gaussians - mixing weights for each Gaussian - mean for each Gaussian - covariance matrix

## Gaussian Mixture Models GMM can be written as $N(x | \mu, \Sigma) = \frac{e^{-1/2(x - \mu)^T \Sigma^{-1} (x - \mu)}}{(2 \pi)^{d/2} \sqrt{ |\Sigma| }}$

## Gaussian Mixture Models - Covariance matrix describes the shape of the gaussian in each direction

## Gaussian Mixture Models Spherical covariance: 1 variance for all Gaussians ![GMM spherical covariance matrix](images/em_k=2_cov_spherical.svg)

## Gaussian Mixture Models Diagonal covariance: 1 variance for each Gaussian ![GMM spherical covariance matrix](images/em_k=2_cov_diag.svg)

## Gaussian Mixture Models Full covariance: variance is dependent on all dimensions for all dimensions ![GMM spherical covariance matrix](images/em_k=2_cov_full.svg)

## Gaussian Mixture Models GMMs can generative! Pick Gaussian $k$ with probability $w_k$, and generate a point from the corresponding Gaussian

Gaussian mixture model

Given some data $D$ drawn from an unknown distribution

Estimate the parameters $\theta$ of a GMM that fits the data

How do we find the Gaussian parameters?

Flipping coins

2 coins, A and B, with bias to land on heads: $\theta_A$ and $\theta_B$

Repeat for $i$ = 1 to 5:

Randomly choose a coin, store as class label $z_i$
['b', 'b', 'a', 'a', 'b']
Flip the coin 10 times, store number of heads as $x_i$
[4, 8, 4, 4, 7]

Flipping coins

We can calculate the bias from these observations directly:

$\dot{\theta}_A = \frac{ \textit{# heads using A}}{ \textit{# flips using A}}$

$\dot{\theta}_A = \frac{8}{20}$

$\dot{\theta}_B = \frac{ \textit{# heads using B}}{ \textit{# flips using B}}$

$\dot{\theta}_B = \frac{19}{30}$

This information was recorded as # heads in $x$ and coin identifier in $z$

What if we didn't have $z$?

Flipping coins

What if we didn't have $z$ (the coin flipped to produce a certain # of heads)?

$z$ is now a "hidden", "latent" variable

How do we find $\theta_A$ and $\theta_B$?

## Flipping coins Start with initial estimates $\dot{\theta} = ( \dot{\theta}_A^{(0)}$, $\dot{\theta}_B^{(0)} )$ For each set of 10 coin flips, where we know the # of heads as $x_i$ i.e. [4, 8, 4, 4, 7] - Determine which coin was most likely using $\dot{\theta}^{(t)}$ - Assume these were the coins that generated the data, and use maximum likelihood to update $\dot{\theta}^{(t+1)}$ Repeat until convergence

Maximum Likelihood Estimation

We know $x_1$, $x_2$, ..., $x_R$ which follow some $N( \mu, \sigma^2 )$

How do we find $\mu$ (assume we know $\sigma^2$)?

Maximum Likelihood Estimation

We know $x_1$, $x_2$, ..., $x_R$ which follow some $N( \mu, \sigma^2 )$

How do we find $\mu$ (assume we know $\sigma^2$)?

Maximum Likelihood Estimation: For which $\mu$ is $x_1, x_2, ... x_R$ most likely?

Maximum a posteriori: Which $\mu$ maximizes $p(\mu | x_1, x_2, ... x_R, \sigma^2 )$?

Expectation-maximization

Expectation-Maximization (EM) algorithm is an approach to maximize likelihood

Expectation-maximization

Iterate:

E-Step: Estimate the distribution given the data and current parameters
M-Step: Maximize the distribution

## Expectation-maximization Expectation step: Calculate the expected log likelihood, $E[log L(\Theta | X,Z)]$ $E[L( \Theta^{(t)} | X, Z )] = p(X|\Theta^{(t)}) = \displaystyle \sum_Z p(X,Z | \Theta^{(t)})$

## Expectation-maximization Maximization step: Find the parameters that maximize the log likelihood $\Theta^{(t+1)} = argmax_{\Theta} E[log L(\Theta | X,Z)]$

Expectation-maximization

Initialization:

Mean of data + random offset
Use K-means to get a good initialization

Expectation-maximization

Termination:

Maximum number of iterations
Threshold change in log-likelihood
Threshold change in parameters

Expectation-maximization

Limitations

Prone to local maxima
Numerical stability of covariance matrix (add noise to stabilize)

## Choosing number of components Can use Bayesian Information Criterion to score GMMs $BIC = -2 ln( L(\Theta | X ) ) + k \cdot ln( n )$ where $L$ is the likelihood, $k$ is the degrees of freedom (number of free parameters), and $n$ is the number of datapoints.

## EM Example Step 1 ![Step 1 of EM](images/em_k=2_cov_full_step=1.svg)

## EM Example Step 2 ![Step 2 of EM](images/em_k=2_cov_full_step=2.svg)

## EM Example Step 3 ![Step 3 of EM](images/em_k=2_cov_full_step=3.svg)

## EM Example Step 4 ![Step 4 of EM](images/em_k=2_cov_full_step=4.svg)

What Next?

Guest Lecture on Aggregation