Gaussian Mixture Models

The present tutorials covers the basics behind the use of Expectation-Maximization (EM) algorithm, in order to perform the estimation of Gaussian Mixture Models. First, we will lay down the problem of measuring from unseen groups. Then, we will develop the expectation, in order to perform the MLE on our data. Finally, the maximization step will allow to update the parameters in order to obtain a full classification model for the data.

Reference slides

Download the slides

The corresponding slides cover

  • Expectation Maximization
  • Mixture models

Tutorial

Expectation Maximization (EM) is a powerful technique for creating maximum likelihood estimators when the variables are difficult to separate. Here, we set up a Gaussian mixture experiment with two Gaussians and derive the corresponding estimators of their means using EM.

8.1 - Unseen groups

Suppose we have a population with two distinct groups of individuals with different heights. If we randomly pick an individual from the population, we don’t know which group the individual belongs to. The goal is to estimate the mean heights of the two distinct groups when we have an unlabeled distribution sampled from both groups.

Both groups are normally distributed, around two different means

Note that we fix the standard deviation to be the same for both groups, but the means () are different. The problem is to estimate the means given that you can’t directly know which group you are picking from.

Then we can write the joint density for this experiment as

where if we pick from group and for group . Note that the comes from the 50/50 chance of picking either group. Unfortunately, since we do not measure the (latent) variable, we have to integrate it out of our density function to account for this handicap. Thus,

Now, since trials are independent, we can write out the likelihood:

We recall that the independent trials assumptions means that the joint probability is just the product of the individual probabilities. Usually, maximum likelihood allows to maximize this as the function of and based on all . However, here we do not know which group we are measuring at each trial so we can not just estimate the parameters for each group separately.

Exercise

  1. Simulate the experiment by sampling two populations
  2. Try different means and see how overlapping regions could be solved.

8.2 - Expectation maximization

The key idea of expectation maximization is that we will pretend to know the hidden value and use a maximum likelihood estimate (this is the maximization part), by computing an expectation (guess) over the missing variable (in this case, ).

As usual, it is easier and more stable use the log of the likelihood function. It is useful to keep track of the incomplete log-likelihood () since it can be proved that it is monotone increasing and provides a good way to identify coding errors.

Expectation step

We denote as the set of parameters and the datapoints. The density function of and can be written as the following

For the expectation part we have to compute

Exercise

  1. Compute the expression of the expectation
  2. Implement a function that computes it given a set of parameters

Solution [Reveal]

Since , the expression of simplifies easily

Now, the only thing left is to find which we can do using Bayes rule:

The term in the denominator comes from summing (integrating) out the items in the full joint density

and since , we finally obtain

Maximization step

Now, given we we have this estimate for , , we can go back and compute the log likelihood estimate of

by maximizing it using basic calculus.

Exercise

  1. Derive the maximization of the log-likelihood
  2. Implement a fonction that performs maximization on a set of data

Solution [Reveal]

The trick is to remember that is fixed, so we only have to maximize the parts. This leads to

and for

Now that we have both the maximization step and the expectation step (), we can simulate the algorithm and compute the estimates for both and .

Expectation maximization is a powerful algorithm that is especially useful when it is difficult to de-couple the variables involved in a standard maximum likelihood estimation. Note that convergence to the “correct” maxima is not guaranteed, as we observed here. This is even more pronounced when there are more parameters to estimate (a much more detailed mathematical derivation is available here).

Exercise

  1. Fill the main loop of the EM algorithm
  2. Plot the evolution of the values for both and
  3. Plot the corresponding incomplete likelihood function
  4. Analyze the speed of convergence for the EM algorithm
  5. Plot the error surface (in terms of parameters)
  6. Test your algorithm for different initial values.

Expected output [Reveal]

8.3 - 3-dimensionnal GMM

The previous toy examples can easily be generalized to any given number of gaussians and any dimensionnality. We are given a data set where is a -dimensional vector measurement. Assume that the points are generated in an IID fashion from an underlying density . We further assume that is defined as a finite mixture model with components

where the are mixture components, ie. distribution defined over , with parameters and are mutually exclusive binary indicator variables representing the identity of the mixture component that generated and finally the are the mixture weights, representing the probability that a randomly selected was generated by component .

We can compute the “membership weight” of data point in cluster , given parameters as

The membership weights above reflect our uncertainty, given and , about which of the components generated vector . Note that we are assuming in our generative mixture model that each was generated by a single component

Mixture model For we can define a Gaussian mixture model by making each of the components a Gaussian density with parameters and . Each component is a multivariate Gaussian density

with its own parameters .

We define the EM algorithm for Gaussian mixtures as follows. The algorithm is an iterative algorithm that starts from some initial estimate of (e.g., random), and then proceeds to iteratively update until convergence is detected. Each iteration consists of an E-step and an M-step.

E-Step
Denote the current parameter values as . Compute (using the equation above for membership weights) for all data points and all mixture components.

M-Step Use the membership weights and the data to calculate new parameter values. Let , i.e., the sum of the membership weights for the -th component—this is the effective number of data points assigned to component . Specifically we compute the new mixture weights

The updated mean is calculated in a manner similar to the previous exercises

Finally, we also the update rule for the covariance matrix identical to how we would normally compute an empirical covariance matrix, except that the contribution of each data point is weighted by

After we have computed all of the new parameters, the M-step is complete and we can now go back and recompute the membership weights in the E-step, then recompute the parameters again in the E-step, and continue updating the parameters in this manner. Each pair of E and M steps is considered to be one iteration.

Exercise

  1. Fill the expectation step in the emGMMFit function
  2. Fill the maximization step in the same function
  3. Test your implementations by running the emGMM3D function
  4. Test your algorithm for different initial values.

Expected output [Reveal]