# Tutorial 2: Statistical Inference#

**Week 0, Day 5: Probability & Statistics**

**By Neuromatch Academy**

**Content creators:** Ulrik Beierholm

**Content reviewers:** Natalie Schaworonkow, Keith van Antwerp, Anoop Kulkarni, Pooya Pakarian, Hyosub Kim

**Production editors:** Ethan Cheng, Ella Batty

#Tutorial Objectives

This tutorial builds on Tutorial 1 by explaining how to do inference through inverting the generative process.

By completing the exercises in this tutorial, you should:

understand what the likelihood function is, and have some intuition of why it is important

know how to summarise the Gaussian distribution using mean and variance

know how to maximise a likelihood function

be able to do simple inference in both classical and Bayesian ways

(Optional) understand how Bayes Net can be used to model causal relationships

# Setup#

## Install and import feedback gadget#

## Show code cell source

```
# @title Install and import feedback gadget
!pip3 install vibecheck datatops --quiet
from vibecheck import DatatopsContentReviewContainer
def content_review(notebook_section: str):
return DatatopsContentReviewContainer(
"", # No text prompt
notebook_section,
{
"url": "https://pmyvdlilci.execute-api.us-east-1.amazonaws.com/klab",
"name": "neuromatch-precourse",
"user_key": "8zxfvwxw",
},
).render()
feedback_prefix = "W0D5_T2"
```

```
# Imports
import numpy as np
import matplotlib.pyplot as plt
import scipy as sp
from scipy.stats import norm
from numpy.random import default_rng # a default random number generator
```

## Figure settings#

## Show code cell source

```
# @title Figure settings
import logging
logging.getLogger('matplotlib.font_manager').disabled = True
import ipywidgets as widgets # interactive display
from ipywidgets import interact, fixed, HBox, Layout, VBox, interactive, Label, interact_manual
%config InlineBackend.figure_format = 'retina'
plt.style.use("https://raw.githubusercontent.com/NeuromatchAcademy/content-creation/main/nma.mplstyle")
```

## Plotting functions#

## Show code cell source

```
# @title Plotting functions
def plot_hist(data, xlabel, figtitle = None, num_bins = None):
""" Plot the given data as a histogram.
Args:
data (ndarray): array with data to plot as histogram
xlabel (str): label of x-axis
figtitle (str): title of histogram plot (default is no title)
num_bins (int): number of bins for histogram (default is 10)
Returns:
count (ndarray): number of samples in each histogram bin
bins (ndarray): center of each histogram bin
"""
fig, ax = plt.subplots()
ax.set_xlabel(xlabel)
ax.set_ylabel('Count')
if num_bins is not None:
count, bins, _ = plt.hist(data, max(data), bins=num_bins)
else:
count, bins, _ = plt.hist(data, max(data)) # 10 bins default
if figtitle is not None:
fig.suptitle(figtitle, size=16)
plt.show()
return count, bins
def plot_gaussian_samples_true(samples, xspace, mu, sigma, xlabel, ylabel):
""" Plot a histogram of the data samples on the same plot as the gaussian
distribution specified by the give mu and sigma values.
Args:
samples (ndarray): data samples for gaussian distribution
xspace (ndarray): x values to sample from normal distribution
mu (scalar): mean parameter of normal distribution
sigma (scalar): variance parameter of normal distribution
xlabel (str): the label of the x-axis of the histogram
ylabel (str): the label of the y-axis of the histogram
Returns:
Nothing.
"""
fig, ax = plt.subplots()
ax.set_xlabel(xlabel)
ax.set_ylabel(ylabel)
# num_samples = samples.shape[0]
count, bins, _ = plt.hist(samples, density=True) # probability density function
plt.plot(xspace, norm.pdf(xspace, mu, sigma), 'r-')
plt.show()
def plot_likelihoods(likelihoods, mean_vals, variance_vals):
""" Plot the likelihood values on a heatmap plot where the x and y axes match
the mean and variance parameter values the likelihoods were computed for.
Args:
likelihoods (ndarray): array of computed likelihood values
mean_vals (ndarray): array of mean parameter values for which the
likelihood was computed
variance_vals (ndarray): array of variance parameter values for which the
likelihood was computed
Returns:
Nothing.
"""
fig, ax = plt.subplots()
im = ax.imshow(likelihoods)
cbar = ax.figure.colorbar(im, ax=ax)
cbar.ax.set_ylabel('log likelihood', rotation=-90, va="bottom")
ax.set_xticks(np.arange(len(mean_vals)))
ax.set_yticks(np.arange(len(variance_vals)))
ax.set_xticklabels(mean_vals)
ax.set_yticklabels(variance_vals)
ax.set_xlabel('Mean')
ax.set_ylabel('Variance')
plt.show()
def posterior_plot(x, likelihood=None, prior=None,
posterior_pointwise=None, ax=None):
"""
Plots normalized Gaussian distributions and posterior.
Args:
x (numpy array of floats): points at which the likelihood has been evaluated
auditory (numpy array of floats): normalized probabilities for auditory likelihood evaluated at each `x`
visual (numpy array of floats): normalized probabilities for visual likelihood evaluated at each `x`
posterior (numpy array of floats): normalized probabilities for the posterior evaluated at each `x`
ax: Axis in which to plot. If None, create new axis.
Returns:
Nothing.
"""
if likelihood is None:
likelihood = np.zeros_like(x)
if prior is None:
prior = np.zeros_like(x)
if posterior_pointwise is None:
posterior_pointwise = np.zeros_like(x)
if ax is None:
fig, ax = plt.subplots()
ax.plot(x, likelihood, '-C1', linewidth=2, label='Auditory')
ax.plot(x, prior, '-C0', linewidth=2, label='Visual')
ax.plot(x, posterior_pointwise, '-C2', linewidth=2, label='Posterior')
ax.legend()
ax.set_ylabel('Probability')
ax.set_xlabel('Orientation (Degrees)')
plt.show()
return ax
def plot_classical_vs_bayesian_normal(num_points, mu_classic, var_classic,
mu_bayes, var_bayes):
""" Helper function to plot optimal normal distribution parameters for varying
observed sample sizes using both classic and Bayesian inference methods.
Args:
num_points (int): max observed sample size to perform inference with
mu_classic (ndarray): estimated mean parameter for each observed sample size
using classic inference method
var_classic (ndarray): estimated variance parameter for each observed sample size
using classic inference method
mu_bayes (ndarray): estimated mean parameter for each observed sample size
using Bayesian inference method
var_bayes (ndarray): estimated variance parameter for each observed sample size
using Bayesian inference method
Returns:
Nothing.
"""
xspace = np.linspace(0, num_points, num_points)
fig, ax = plt.subplots()
ax.set_xlabel('n data points')
ax.set_ylabel('mu')
plt.plot(xspace, mu_classic,'r-', label="Classical")
plt.plot(xspace, mu_bayes,'b-', label="Bayes")
plt.legend()
plt.show()
fig, ax = plt.subplots()
ax.set_xlabel('n data points')
ax.set_ylabel('sigma^2')
plt.plot(xspace, var_classic,'r-', label="Classical")
plt.plot(xspace, var_bayes,'b-', label="Bayes")
plt.legend()
plt.show()
```

# Section 1: Basic probability#

## Section 1.1: Basic probability theory#

### Video 1: Basic Probability#

### Submit your feedback#

## Show code cell source

```
# @title Submit your feedback
content_review(f"{feedback_prefix}_Basic_Probability_Video")
```

This video covers basic probability theory, including complementary probability, conditional probability, joint probability, and marginalisation.

## Click here for text recap of video

Previously we were only looking at sampling or properties of a single variables, but as we will now move on to statistical inference, it is useful to go over basic probability theory.

As a reminder, probability has to be in the range 0 to 1 \(P(A) \in [0,1] \)

and the complementary can always be defined as

\(P(\neg A) = 1-P(A)\)

When we have two variables, the *conditional probability* of \(A\) given \(B\) is

\(P (A|B) = P (A \cap B)/P (B)=P (A, B)/P (B)\)

while the *joint probability* of \(A\) and \(B\) is

\(P(A \cap B)=P(A,B) = P(B|A)P(A) = P(A|B)P(B) \)

We can then also define the process of *marginalisation* (for discrete variables) as

\(P(A)=\sum P(A,B)=\sum P(A|B)P(B)\)

where the summation is over the possible values of \(B\).

As an example if \(B\) is a binary variable that can take values \(B+\) or \(B0\) then \(P(A)=\sum P(A,B)=P(A|B+)P(B+)+ P(A|B0)P(B0) \).

For continuous variables marginalization is given as \(P(A)=\int P(A,B) dB=\int P(A|B)P(B) dB\)

### Math Exercise 1.1: Probability example#

To remind ourselves of how to use basic probability theory we will do a short exercise (no coding needed!), based on measurement of binary probabilistic neural responses. As shown by Hubel and Wiesel in 1959 there are neurons in primary visual cortex that respond to different orientations of visual stimuli, with different neurons being sensitive to different orientations. The numbers in the following are however purely fictional.

Imagine that your collaborator tells you that they have recorded the activity of visual neurons while presenting either a horizontal or vertical grid as a visual stimulus. The activity of the neurons is measured as binary: they are either active or inactive in response to the stimulus.

After recording from a large number of neurons they find that when presenting a horizontal grid, on average 40% of neurons are active, while 30% respond to vertical grids.

We will use the following notation to indicate the probability that a randomly chosen neuron responds to horizontal grids

and this to show the probability it responds to vertical:

We can find the complementary event, that the neuron does not respond to the horizontal grid, using the fact that these events must add up to 1. We see that the probability the neuron does not respond to the horizontal grid (\(h_0\)) is

and that the probability to not respond to vertical is

We will practice computing various probabilities in this framework.

### A) Product#

Assuming that the horizontal and vertical orientation selectivity are independent, what is the probability that a randomly chosen neuron is sensitive to both horizontal and vertical orientations?

Hint: Two events are independent if the outcome of one does not affect the outcome of the other.

### B) Joint probability generally#

A collaborator informs you that actually these are not independent. Of those neurons that respond to vertical, only 10 percent also respond to horizontal, i.e. the probability of responding to horizonal *given* that it responds to vertical is \(P(h+|v+)=0.1\)

Given this new information, what is now the probability that a randomly chosen neuron is sensitive to both horizontal and vertical orientations?

### C) Conditional probability#

You start measuring from a neuron and find that it responds to horizontal orientations. What is now the probability that it also responds to vertical, i.e., \(P(v_+|h_+)\))?

### D) Marginal probability#

Lastly, let’s check that everything has been done correctly. Given our knowledge about the conditional probabilities, we should be able to use marginalisation to recover the marginal probability of a random neuron responding to vertical orientations, i.e.,\(P(v_+)\). We know from above that this should equal 0.3.

Calculate \(P(v_+)\) based on the conditional probabilities for \(P(v_+|h_+)\) and \(P(v_+|h_0)\) (the latter which you will need to calculate).

#### Submit your feedback#

## Show code cell source

```
# @title Submit your feedback
content_review(f"{feedback_prefix}_Probability_Example_Main_Exercise")
```

## Section 1.2: Markov chains#

### Video 2: Markov Chains#

### Submit your feedback#

## Show code cell source

```
# @title Submit your feedback
content_review(f"{feedback_prefix}_Markov_Chains_Video")
```

### Coding exercise 1.2 Markov chains#

We will practice more probability theory by looking at **Markov chains**. The Markov property specifies that you can fully encapsulate the important properties of a system based on its *current* state at the current time, any previous history does not matter. It is memoryless.

As an example imagine that a rat is able to move freely between 3 areas: a dark rest area
(\(state=1\)), a nesting area (\(state=2\)) and a bright area for collecting food (\(state=3\)). Every 5 minutes (timepoint \(i\)) we record the rat’s location. We can use a **categorical distribution** to look at the probability that the rat moves to one state from another.

The table below shows the probability of the rat transitioning from one area to another between timepoints (\(state_i\) to \(state_{i+1}\)).

We are modeling this as a Markov chain, so the animal is only in one of the states at a time and can transition between the states.

We want to get the probability of each state at time \(i+1\). We know from Section 1.1 that we can use marginalisation:

Let’s say we had a row vector (a vector defined as a row, not a column so matrix multiplication will work out) of the probabilities of the current state:

If we actually know where the rat is at the current time point, this would be deterministic (e.g., \(P_i = [0, 1, 0]\) if the rat is in state 2). Otherwise, this could be probabilistic (e.g. \(P_i = [0.1, 0.7, 0.2]\)).

To compute the vector of probabilities of the state at the time \(i+1\), we can use linear algebra and multiply our vector of the probabilities of the current state with the transition matrix. Recall your matrix multiplication skills from W0D3 to check this!

where \(T\) is our transition matrix.

This is the same formula for every step, which allows us to get the probabilities for a time more than 1 step in advance easily. If we started at \(i=0\) and wanted to look at the probabilities at step \(i=2\), we could do:

So, every time we take a further step we can just multiply with the transition matrix again. So, the probability vector of states at j timepoints after the current state at timepoint i is equal to the probability vector at timepoint i times the transition matrix raised to the jth power.

If the animal starts in area 2, what is the probability the animal will again be in area 2 when we check on it 20 minutes (4 transitions) later?

Fill in the transition matrix in the code below.

```
###################################################################
## TODO for student
## Fill out the following then remove
raise NotImplementedError("Student exercise: compute state probabilities after 4 transitions")
###################################################################
# Transition matrix
transition_matrix = np.array([[ 0.2, 0.6, 0.2], [ .6, 0.3, 0.1], [0.8, 0.2, 0]])
# Initial state, p0
p0 = np.array([0, 1, 0])
# Compute the probabilities 4 transitions later (use np.linalg.matrix_power to raise a matrix a power)
p4 = ...
# The second area is indexed as 1 (Python starts indexing at 0)
print(f"The probability the rat will be in area 2 after 4 transitions is: {p4[1]}")
```

You should get a probability of `0.4311`

, i.e., there is a 43.11% chance that you will find the rat in area 2 in 20 minutes.

What is the average amount of time spent by the rat in each of the states?

Implicit in the question is the idea that we can start off with a random initial state and then measure how much relative time is spent in each area. If we make a few assumptions (e.g. ergodic or ‘randomly mixing’ system), we can instead start with an initial random distribution and see how the final probabilities of each state after many time steps (100) to estimate the time spent in each state.

```
# Initialize random initial distribution
p_random = np.ones((1, 3))/3
###################################################################
## TODO for student: Fill compute the state matrix after 100 transitions
raise NotImplementedError("Student exercise: need to complete computation below")
###################################################################
# Fill in the missing line to get the state matrix after 100 transitions, like above
p_average_time_spent = ...
print(f"The proportion of time spend by the rat in each of the three states is: {p_average_time_spent[0]}")
```

The proportion of time spend in each of the three areas are `0.4473`

, `0.4211`

, and `0.1316`

, respectively.

### Submit your feedback#

## Show code cell source

```
# @title Submit your feedback
content_review(f"{feedback_prefix}_Markov_Chains_Exercise")
```

Imagine now that if the animal is satiated and tired the transitions change to:

Try repeating the questions above for this table of transitions by changing the transition matrix. Based on the probability values, what would you predict? Check how much time the rat spends on average in each area and see if it matches your predictions.

**Main course preview:** The Markov property is extremely important for many models, particularly Hidden Markov Models, discussed on hidden dynamics day, and for methods such as Markov Chain Monte Carlo sampling.

# Section 2: Statistical inference and likelihood#

## Section 2.1: Likelihoods#

### Video 3: Statistical inference and likelihood#

### Submit your feedback#

## Show code cell source

```
# @title Submit your feedback
content_review(f"{feedback_prefix}_Statistical_inference_and_likelihood_Video")
```

**Correction to video**: The variance estimate that maximizes the likelihood is \(\bar{\sigma}^2=\frac{1}{n} \sum_i (x_i-\bar{x})^2 \). This is a biased estimate. Shown in the video is the sample variance, which is an unbiased estimate for variance: \(\bar{\sigma}^2=\frac{1}{n-1} \sum_i (x_i-\bar{x})^2 \). See section 2.2.3 for more details.

## Click here for text recap of video

A generative model (such as the Gaussian distribution from the previous tutorial) allows us to make predictions about outcomes.

However, after we observe \(n\) data points, we can also evaluate our model (and any of its associated parameters) by calculating the **likelihood** of our model having generated each of those data points \(x_i\).

For all data points \(\mathbf{x}=(x_1, x_2, x_3, ...x_n) \) we can then calculate the likelihood for the whole dataset by computing the product of the likelihood for each single data point.

While the likelihood may be written as a conditional probability (\(P(x|\mu,\sigma)\)), we refer to it as the **likelihood function**, \(L(\mu,\sigma)\). This slight switch in notation is to emphasize our focus: we use likelihood functions when the data points \(\mathbf{x}\) are fixed and we are focused on the parameters.

Our new notation makes clear that the likelihood \(L(\mu,\sigma)\) is a function of \(\mu\) and \(\sigma\), not of \(\mathbf{x}\).

In the last tutorial we reviewed how the data was generated given the selected parameters of the generative process. If we do not know the parameters \(\mu\), \(\sigma\) that generated the data, we can try to **infer** which parameter values (given our model) gives the best (highest) likelihood. This is what we call statistical inference: trying to infer what parameters make our observed data the most likely or probable?

### Coding Exercise 2.1: Computing likelihood#

Let’s start with computing the likelihood of some set of data points being drawn from a Gaussian distribution with a mean and variance we choose.

As multiplying small probabilities together can lead to very small numbers, it is often convenient to report the *logarithm* of the likelihood. This is just a convenient transformation and as logarithm is a monotonically increasing function this does not change what parameters maximise the function.

```
def compute_likelihood_normal(x, mean_val, standard_dev_val):
""" Computes the log-likelihood values given a observed data sample x, and
potential mean and variance values for a normal distribution
Args:
x (ndarray): 1-D array with all the observed data
mean_val (scalar): value of mean for which to compute likelihood
standard_dev_val (scalar): value of variance for which to compute likelihood
Returns:
likelihood (scalar): value of likelihood for this combination of means/variances
"""
###################################################################
## TODO for student
raise NotImplementedError("Student exercise: compute likelihood")
###################################################################
# Get probability of each data point (use norm.pdf from scipy stats)
p_data = ...
# Compute likelihood (sum over the log of the probabilities)
likelihood = ...
return likelihood
# Set random seed
np.random.seed(0)
# Generate data
true_mean = 5
true_standard_dev = 1
n_samples = 1000
x = np.random.normal(true_mean, true_standard_dev, size = (n_samples,))
# Compute likelihood for a guessed mean/standard dev
guess_mean = 4
guess_standard_dev = .1
likelihood = compute_likelihood_normal(x, guess_mean, guess_standard_dev)
print(likelihood)
```

You should get a likelihood of `-92904.81`

.

This is somewhat meaningless to us! For it to be useful, we need to compare it to the likelihoods computing using other guesses of the mean or standard deviation. The visualization below shows us the likelihood for various values of the mean and the standard deviation. Essentially, we are performing a rough grid-search over means and standard deviations. What would you guess as the true mean and standard deviation based on this visualization?

Execute to visualize likelihoods

## Show code cell source

```
# @markdown Execute to visualize likelihoods
# Set random seed
np.random.seed(0)
# Generate data
true_mean = 5
true_standard_dev = 1
n_samples = 1000
x = np.random.normal(true_mean, true_standard_dev, size = (n_samples,))
# Compute likelihood for different mean/variance values
mean_vals = np.linspace(1, 10, 10) # potential mean values to ry
standard_dev_vals = np.array([0.7, 0.8, 0.9, 1, 1.2, 1.5, 2, 3, 4, 5]) # potential variance values to try
# Initialise likelihood collection array
likelihood = np.zeros((mean_vals.shape[0], standard_dev_vals.shape[0]))
# Compute the likelihood for observing the gvien data x assuming
# each combination of mean and variance values
for idxMean in range(mean_vals.shape[0]):
for idxVar in range(standard_dev_vals .shape[0]):
likelihood[idxVar,idxMean]= sum(np.log(norm.pdf(x, mean_vals[idxMean],
standard_dev_vals[idxVar])))
# Uncomment once you've generated the samples and compute likelihoods
xspace = np.linspace(0, 10, 100)
plot_likelihoods(likelihood, mean_vals, standard_dev_vals)
```

#### Submit your feedback#

## Show code cell source

```
# @title Submit your feedback
content_review(f"{feedback_prefix}_Computing_likelihood_Exercise")
```

## Section 2.2: Maximum likelihood#

### Video 4: Maximum likelihood#

### Submit your feedback#

## Show code cell source

```
# @title Submit your feedback
content_review(f"{feedback_prefix}_Maximum_likelihood_Video")
```

Implicitly, by looking for the parameters that give the highest likelihood in the last section, we have been searching for the **maximum likelihood** estimate.

In next sections, we will look at other ways of inferring such parameter variables.

### Section 2.2.1: Searching for best parameters#

We want to do inference on this data set, i.e. we want to infer the parameters that most likely gave rise to the data given our model. Intuitively that means that we want as good as possible a fit between the observed data and the probability distribution function with the best inferred parameters. We can search for the best parameters manually by trying out a bunch of possible values of the parameters, computing the likelihoods, and picking the parameters that resulted in the highest likelihood.

#### Interactive Demo 2.2: Maximum likelihood inference#

Try to see how well you can fit the probability distribution to the data by using the demo sliders to control the mean and standard deviation parameters of the distribution. We will visualize the histogram of data points (in blue) and the Gaussian density curve with that mean and standard deviation (in red). Below, we print the log-likelihood.

What (approximate) values of mu and sigma result in the best fit?

How does the value below the plot (the log-likelihood) change with the quality of fit?

Make sure you execute this cell to enable the widget and fit by hand!

## Show code cell source

```
# @markdown Make sure you execute this cell to enable the widget and fit by hand!
# Generate data
true_mean = 5
true_standard_dev = 1
n_samples = 1000
vals = np.random.normal(true_mean, true_standard_dev, size = (n_samples,))
def plotFnc(mu,sigma):
loglikelihood= sum(np.log(norm.pdf(vals,mu,sigma)))
#calculate histogram
#prepare to plot
fig, ax = plt.subplots()
ax.set_xlabel('x')
ax.set_ylabel('probability')
#plot histogram
count, bins, ignored = plt.hist(vals,density=True)
x = np.linspace(0,10,100)
#plot pdf
plt.plot(x, norm.pdf(x,mu,sigma),'r-')
plt.show()
print("The log-likelihood for the selected parameters is: " + str(loglikelihood))
#interact(plotFnc, mu=5.0, sigma=2.1);
#interact(plotFnc, mu=widgets.IntSlider(min=0.0, max=10.0, step=1, value=4.0),sigma=widgets.IntSlider(min=0.1, max=10.0, step=1, value=4.0));
interact(plotFnc, mu=(0.0, 15.0, 0.1),sigma=(0.1, 5.0, 0.1));
```

Doing this was similar to the grid searched image from Section 2.1. Really, we want to see if we can do inference on observed data in a bit more principled way.

##### Submit your feedback#

## Show code cell source

```
# @title Submit your feedback
content_review(f"{feedback_prefix}_Maximum_likelihood_inference_Interactive_Demo")
```