# Bayesian inference

The present tutorials covers the basics behind *Bayesian inference* and how to use the **Bayes theorem** to perform classification in the case of uniformly distributed data points. First, we will see how to derive *estimators* for the different properties of the distributions, and verify that these are *unbiased*. Then, we will implement the **Maximum Likelihood** Estimators (MLE) in order to perform classification of a dataset. First, we will assess the case where parameters are known to implement the discriminant function and decision rule. Then, we will perform the MLE to obtain the means and covariance matrix for each class.

# Reference slides

Download the slides

The corresponding slides cover

- Bayesian learning
- Undirected graphical models
- Maximum likelihood

# Tutorial

## 7.0 - Bayesian framework

Two alternative interpretations of probability can be considered

**Frequentist**(the more*classical*approach), assumes that probability is the long-run frequency of events. For example, the*probability of plane accidents*is interpreted as the*long-term frequency of plane accidents*. This becomes harder to interpret when events have no long-term frequency of occurrences. Interestingly, in that case (eg. probability in an election, which happens only once) frequentists consider alternative realities and that across all these realities, the frequency of occurrences defines the probability.**Bayesian**interprets probability as measure of*believability in an event*. Therefore, a probability is measure of*belief*, or confidence, of an event occurring. Interestingly, this definition leaves room for conflicting beliefs between individuals, based on the different*information*they have about the world. Hence, bayesian inference is mostly based on updating your beliefs after considering new evidence, which differs from more traditional statistical inference by preserving*uncertainty*.

To align with probability notation, we denote a belief about event as , called the *prior probability* of an event to occur. We denote the updated belief as , interpreted as the probability of *given* the new evidence , called the *posterior probability*. The prior beliefis not completely removed after seeing new evidence , but we *re-weight the prior* to incorporate new evidence (i.e. we put more weight, or confidence, on some beliefs versus others). By introducing prior uncertainty about events, we admit that any guess we make can be wrong. However, with more and more instances of evidence, our prior belief is *washed out* by the new evidence. As we gather an *infinite* amount of evidence , the Bayesian results (often) align with frequentist results. Hence for small , inference is *unstable*, where frequentist estimates have more variance and larger confidence intervals. However, by introducing a prior, and returning probabilities, we *preserve the uncertainty* that reflects the instability on a small dataset.

Updating the *prior belief* to obtain our *posterior belief* is done via the the Bayes’ Theorem

We see that our posterior belief of event given the new evidence is proportional to () the *likelihood* of observing this particular evidence given the event multiplied by our prior belief in that particular event .

## 7.1 - Bayesian inference

Suppose we have coin and want to estimate the probability of heads () for it. The coin is Bernoulli distributed:

where is the outcome, *1* for heads and *0* for tails. Based on *independent* flips, we have the likelihood:

(the independent-trials assumption allows us to just substitute everything into ).

The idea of *maximum likelihood* will be to maximize this as the function of after given all of the data. This means that our estimator, , is a function of the observed data, and as such, is a random variable with its own distribution.

**Defining the estimator**

The only way to know for sure that our estimator is correctly defined is to check if the estimator is unbiased, namely, if

**Exercise**

- Compute the
*log-likelihood*of our given problem - Based on this, compute its derivative
- Solve it to find the estimator
- Verify that this estimator is unbiased
- Compute the variance of the estimator

**Solution** [Reveal]

Because this problem is simple, we can solve for this in general noting that since or , the terms in the product of above are either , if or if . This means that we can write

with the corresponding *log-likelihood* defined as

Taking the derivative of this gives:

and solving this leads to

This is our *estimator* for . To check if this estimator is biased, we compute its expectation:

by linearity of the expectation and where

Therefore,

This means that the esimator is unbiased. This is good news. We almost always want our estimators to be unbiased. Similarly,

and where

and by the independence assumption,

Thus,

So, the variance of the estimator, is the following:

Note that the in the denominator means that the variance asymptotically goes to zero as increases, leading to a better estimate of the underlying . Unfortunately, this formula for the variance is practically useless because we have to know to compute it and is the parameter we are trying to estimate in the first place! But, looking at , we can immediately notice that if , then there is no estimator variance because the outcomes are guaranteed to be tails. Also, the maximum of this variance, for whatever , happens at . This is our worst case scenario and the only way to compensate is with more samples (i.e. larger ).

**Full estimator density**

In general, computing the mean and variance of the estimator is insufficient to characterize the underlying probability density of , except if we knew that were normally distributed. This is where the *central limit theorem*. Indeed, the form of the estimator, implies that is normally distributed, but only *asymptotically*, which doesn’t quantify how many samples we need. Unfortunately, in the real world, each sample may be precious. Hence, to write out the full density for , we first have to ask what is the probability that the estimator will equal a specific value such as

This can only happen when , . The corresponding probability can be computed from the density

Likewise, if has one value equal to one, then

where the comes from the ways to pick one value equal to one from the elements . Continuing this way, we can construct the entire density as

where the term on the left is the binomial coefficient of things taken at a time. This is the binomial distribution and it’s not the density for , but rather for . We’ll leave this as-is because it’s easier to work with below. We just have to remember to keep track of the factor.

## 7.2 - Gaussian classification

Maximum Likelihood Estimate (MLE) allows to perform typical statistical pattern classification tasks. In the cases where **probabilistic models and parameters are known**, the design of a Bayes’ classifier is rather easy. However, in real applications, we are rarely given this information and this is where the MLE comes into play.

MLE still **requires partial knowledge** about the problem. We have to assume that the **model of the class conditional densities is known** (usually Gaussian distributions). Hence, Using MLE, we want to estimate the values of the parameters of a given distribution for the class-conditional densities, for example, the *mean* and *variance* assuming that the class-conditional densities are *normal* distributed (Gaussian) with

## 7.3 - Parameters known

Imagine that we want to classify data consisting of two-dimensional patterns, that could belong to 1 out of 3 classes .

Let’s assume the following information about the model where we use continuous univariate normal (Gaussian) model for the class-conditional densities

Furthermore, we consider for this first problem that we know the distributions of the classes, ie. their mean and covariances.

Therefore, the means of the sample distributions for 2-dimensional features are defined as

The **covariance matrices** for the statistically independent and identically distributed (‘i.i.d’) features

Finally, we consider that all classes have an **equal prior probability**

**Exercise**

- Generate some data (samples from the multivariate Gaussians) following classes distributions
- Plot the class-dependent data

**Expected output** [Reveal]

Here, our **objective function** is to maximize the discriminant function , which we define as the posterior probability to perform a **minimum-error classification** (Bayes classifier).

So that our decision rule is to choose the class for which is max., where

**Exercise**

- Implement the discriminant function
- Implement the decision rule (classifier)
- Classify the data generated in the previous exercise
- Plot the confusion matrix
- Calculate the empirical error

## 7.4 - Unknown parameters

In contrast to the previous case, let us assume that we only know the number of parameters for the class conditional densities , and we want to use a Maximum Likelihood Estimation (MLE) to estimate the quantities of these parameters from the training data.

Given the information about the our model (the data is normal distributed) the 2 parameters to be estimated for each class are and , which are summarized by the

parameter vector

For the Maximum Likelihood Estimate (MLE), we assume that we have a set of samples that are *i.i.d.* (independent and identically distributed, drawn with probability . Thus, we can **work with each class separately** and omit the class labels, so that we write the probability density as

**Likelihood of **

Thus, the probability of observing is

Where is also called the *likelihood of *

We know that (remember that we dropped the class labels, since we are working with every class separately). And the mutlivariate normal density is given as

Therefore, we obtain

and the log of the multivariate density

In order to obtain the MLE , we maximize , which can be done via differentiation

**Exercise**

- Perform the differentiation for to obtain the estimator of the mean
- Perform the differentiation for to obtain the estimator of the covariance matrix
- Implement the two estimators as functions based on a set of data
- Apply these estimators (MLE) in order to obtain estimated parameters
- Re-compute the classification errors on the previous dataset

**Solution** [Reveal]

**Estimator for the mean **

After doing the differentiation, we find that the MLE of the parameter is given by the equation

As you can see, this is simply the mean of our dataset, so we can implement the code very easily and compare the estimate to the actual values for .

**Estimator the covariance matrix **

Analog to we can find the equation for the via differentiation so that we come to this equation:

which we can also implement and then compare to the actual values of .

**Classification**

Using the estimated parameters and , which we obtained via MLE, we can simply compute the error on the sample dataset again.

**Expected output** [Reveal]

## 7.5 - Audio source separation

The maximum likelihood estimator (MLE) is widely used in practical signal modeling and we can show that the MLE is equivalent to the least squares estimator for a wide class of problems, including well resolved sinusoids in white noise. We are going to consider a model consisting of a complex sinusoid in additive white (complex) noise:

Here, is the complex amplitude of the sinusoid, and is white noise that we assume to be Gaussian distributed with zero mean. Hence, we assume that its probability density function is given by

We express the zero-mean Gaussian assumption by writing

The parameter is the *variance* of the random process , and is its standard deviation. It turns out that when Gaussian random variables are uncorrelated (i.e., when is white noise), they are also independent. This means that the probability of observing particular values of and is given by the product of their respective probabilities. We will now use this fact to compute an explicit probability for observing any data sequence

Since the sinusoidal part of our signal model, , is deterministic; i.e., it does not including any random components; it may be treated as the time-varying mean of a Gaussian random process . That is, our signal model can be rewritten as

and the probability density function for the whole set of observations , is given by

Thus, given the noise variance and the three sinusoidal parameters (remember that ), we can compute the relative probability of any observed data samples .

We can generalize this approach in order to perform a complete blind audio source separation algorithm, such as detailed in the following paper

Févotte, C., & Cardoso, J. F. “Maximum likelihood approach for blind audio source separation using time-frequency Gaussian source models”. *IEEE Workshop on Applications of Signal Processing to Audio and Acoustics*, 2005. (pp. 78-81). IEEE.

You can try to implement this by relying on the full paper

**Exercise**

- Implement the single sinusoid extraction
- Apply this approach to multiple sinusoids
- Follow the paper to implement blind source separation