Licensed under the Creative Commons Attribution-NonCommercial license. Please share and remix non-commercially, mentioning its origin.

Produced with R version 3.5.1 and pomp version


Students completing this lesson will:

  1. Gain an understanding of the nature of the problem of likelihood computation for POMP models.
  2. Be able to explain the simplest particle filter algorithm.
  3. Gain experience in the visualization and exploration of likelihood surfaces.
  4. Be able to explain the tools of likelihood-based statistical inference that become available given numerical accessibility of the likelihood function.


The likelihood function

Definition of the likelihood function

  • Data are a sequence of \(N\) observations, denoted \(y_{1:N}^*\).

  • A statistical model is a density function \(f_{Y_{1:N}}(y_{1:N};\theta)\) which defines a probability distribution for each value of a parameter vector \(\theta\).

  • To perform statistical inference, we must decide, among other things, for which (if any) values of \(\theta\) it is reasonable to model \(y^*_{1:N}\) as a random draw from \(f_{Y_{1:N}}(y_{1:N};\theta)\).

  • The likelihood function is \[{\mathcal{L}}(\theta) = f_{Y_{1:N}}(y^*_{1:N};\theta),\] the density function evaluated at the data.

  • It is often convenient to work with the log likelihood function, \[{\ell}(\theta)= \log {\mathcal{L}}(\theta) = \log f_{Y_{1:N}}(y^*_{1:N};\theta).\]

Modeling using discrete and continuous distributions

  • Recall that the probability distribution \(f_{Y_{1:N}}(y_{1:N};\theta)\) defines a random variable \(Y_{1:N}\) for which probabilities can be computed as integrals of \(f_{Y_{1:N}}(y_{1:N};\theta)\).

  • Specifically, for any event \(E\) describing a set of possible outcomes of \(Y_{1:N}\), \[{\mathbb{P}\left[{Y_{1:N} \in E}\right]} = \int_E f_{Y_{1:N}}(y_{1:N};\theta)\, dy_{1:N}.\]

  • If the model corresponds to a discrete distribution, then the integral is replaced by a sum and the probability density function is called a probability mass function.

  • The definition of the likelihood function remains unchanged. We will use the notation of continuous random variables, but all the methods apply also to discrete models.

Review of likelihood-based inference

For now, suppose that software exists to evaluate and maximize the likelihood function, up to a tolerable numerical error, for the dynamic models of interest. Our immediate task is to think about how to use that capability.

The maximum likelihood estimate (MLE)

  • A maximum likelihood estimate (MLE) is \[ \hat\theta = \underset{\theta}{\arg\max} {\ell}(\theta),\] where \(\underset{\theta}{\arg\max} g(\theta)\) means a value of argument \(\theta\) at which the maximum of the function \(g\) is attained, so \(g\big(\underset{\theta}{\arg\max} g(\theta)\big) = \max_\theta g(\theta)\).

  • If there are many values of \(\theta\) giving the same maximum value of the likelihood, then an MLE still exists but is not unique.

  • Note that \(\underset{\theta}{\arg\max} {\mathcal{L}}(\theta)\) and \(\underset{\theta}{\arg\max} {\ell}(\theta)\) are the same. Why?

Standard errors for the MLE

  • Of course, we have a responsibility to attach a measure of uncertainty to our parameter estimates!

  • Usually, this means obtaining a confidence interval, or in practice an interval close to a true confidence interval which should formally be called an approximate confidence interval. In practice, the word “approximate” is often dropped!

  • There are three main approaches to estimating the statistical uncertainty in an MLE.

  1. The Fisher information.

    • A computationally quick approach when one has access to satisfactory numerical second derivatives of the log likelihood.

    • The approximation is satisfactory only when \(\hat\theta\) is well approximated by a normal distribution.

    • Neither of the two requirements above are typically met for POMP models.

    • A review of standard errors via Fisher information is provided as a supplement.

  2. Profile likelihood estimation. This approach is generally preferable to the Fisher information for POMP models.

  3. A simulation study, also known as a bootstrap.

    • If done carefully and well, this can be the best approach.

    • A confidence interval is a claim about reproducibility. You claim, so far as your model is correct, that on 95% of realizations from the model, a 95% confidence interval you have constructed will cover the true value of the parameter.

    • A simulation study can check this claim fairly directly, but requires the most effort.

    • The simulation study takes time for you to develop and debug, time for you to explain, and time for the reader to understand and check what you have done. We usually carry out simulation studies to check our main conclusions only.

    • Further discussion of bootstrap methods for POMP models is provided as a supplement.

Confidence intervals via the profile likelihood

  • Let’s consider the problem of obtaining a confidence interval for the first component of \(\theta\). We’ll write \[\theta=(\phi,\psi).\]

  • The profile log likelihood function of \(\phi\) is defined to be \[ {\ell_\mathrm{profile}}(\phi) = \max_{\psi}{\ell}(\phi,\psi).\] In general, the profile likelihood of one parameter is constructed by maximizing the likelihood function over all other parameters.

  • Note that, \(\max_{\phi}{\ell_\mathrm{profile}}(\phi) = \max_{\theta}{\ell}(\theta)\) and that maximizing the profile likelihood \({\ell_\mathrm{profile}}(\phi)\) gives the MLE, \(\hat{\theta}\). Why?

  • An approximate 95% confidence interval for \(\phi\) is given by \[ \big\{\phi : {\ell}(\hat\theta) - {\ell_\mathrm{profile}}(\phi)\big\} < 1.92.\]

  • This is known as a profile likelihood confidence interval. The cutoff \(1.92\) is derived using Wilks’s theorem, which we will discuss in more detail when we develop likelihood ratio tests.

  • Although the asymptotic justification of Wilks’s theorem is the same limit that justifies the Fisher information standard errors, profile likelihood confidence intervals tend to work better than Fisher information confidence intervals when \(N\) is not so large—particularly when the log likelihood function is not close to quadratic near its maximum.

Likelihood-based model selection and model diagnostics

  • For nested hypotheses, we can carry out model selection by likelihood ratio tests.

  • For non-nested hypotheses, likelihoods can be compared using Akaike’s information criterion (AIC) or related methods.

Likelihood ratio tests for nested hypotheses

  • The whole parameter space on which the model is defined is \(\Theta\subset{\mathbb{R}}^D\).

  • Suppose we have two nested hypotheses \[\begin{eqnarray} H^{\langle 0\rangle} &:& \theta\in \Theta^{\langle 0\rangle}, \\ H^{\langle 1\rangle} &=& \theta\in \Theta^{\langle 1\rangle}, \end{eqnarray}\] defined via two nested parameter subspaces, \(\Theta^{\langle 0\rangle}\subset \Theta^{\langle 1\rangle}\), with respective dimensions \(D^{\langle 0\rangle}< D^{\langle 1\rangle}\le D\).

  • We consider the log likelihood maximized over each of the hypotheses, \[\begin{eqnarray} \ell^{\langle 0\rangle} &=& \sup_{\theta\in \Theta^{\langle 0\rangle}} \ell(\theta), \\ \ell^{\langle 1\rangle} &=& \sup_{\theta\in \Theta^{\langle 1\rangle}} \ell(\theta). \end{eqnarray}\]

  • A useful approximation asserts that, under the hypothesis \(H^{\langle 0\rangle}\), \[ \ell^{\langle 1\rangle} - \ell^{\langle 0\rangle} \approx (1/2) \chi^2_{D^{\langle 1\rangle}- D^{\langle 0\rangle}}, \] where \(\chi^2_d\) is a chi-squared random variable on \(d\) degrees of freedom and \(\approx\) means “is approximately distributed as.”

  • We will call this the Wilks approximation.

  • The Wilks approximation can be used to construct a hypothesis test of the null hypothesis \(H^{\langle 0\rangle}\) against the alternative \(H^{\langle 1\rangle}\).

  • This is called a likelihood ratio test since a difference of log likelihoods corresponds to a ratio of likelihoods.

  • When the data are IID, \(N\to\infty\), and the hypotheses satisfy suitable regularity conditions, this approximation can be derived mathematically and is known as Wilks’s theorem.

  • The chi-squared approximation to the likelihood ratio statistic may be useful, and can be assessed empirically by a simulation study, even in situations that do not formally satisfy any known theorem.

The connection between Wilks’s theorem and profile likelihood

  • Suppose we have an MLE, written \(\hat\theta=(\hat\phi,\hat\psi)\), and a profile log likelihood for \(\phi\), given by \({\ell_\mathrm{profile}}(\phi)\).

  • Consider the likelihood ratio test for the nested hypotheses \[\begin{eqnarray} H^{\langle 0\rangle} &:& \phi = \phi_0, \\ H^{\langle 1\rangle} &:& \mbox{$\phi$ unconstrained}. \end{eqnarray}\]

  • We can check what the 95% cutoff is for a chi-squared distribution with one degree of freedom,

## [1] 3.841459
  • Wilks’s theorem then gives us a hypothesis test with approximate size \(5\%\) that rejects \(H^{\langle 0\rangle}\) if \({\ell_\mathrm{profile}}(\hat\phi)-{\ell_\mathrm{profile}}(\phi_0)<3.84/2\).

  • It follows that, with probability \(95\%\), the true value of \(\phi\) falls in the set \[\big\{\phi: {\ell_\mathrm{profile}}(\hat\phi)-{\ell_\mathrm{profile}}(\phi)<1.92\big\}.\] So, we have constructed a profile likelihood confidence interval, consisting of the set of points on the profile likelihood within 1.92 log units of the maximum.

  • This is an example of a general duality between confidence intervals and hypothesis tests.

Akaike’s information criterion (AIC)

  • Likelihood ratio tests provide an approach to model selection for nested hypotheses, but what do we do when models are not nested?

  • A more general approach is to compare likelihoods of different models by penalizing the likelihood of each model by a measure of its complexity.

  • Akaike’s information criterion AIC is given by \[ \textrm{AIC} = -2\,{\ell}(\hat{\theta}) + 2\,D\] “Minus twice the maximized log likelihood plus twice the number of parameters.”

  • We are invited to select the model with the lowest AIC score.

  • AIC was derived as an approach to minimizing prediction error. Increasing the number of parameters leads to additional overfitting which can decrease predictive skill of the fitted model.

  • Viewed as a hypothesis test, AIC may have weak statistical properties. It can be a mistake to interpret AIC by making a claim that the favored model has been shown to provides a superior explanation of the data. However, viewed as a way to select a model with reasonable predictive skill from a range of possibilities, it is often useful.

  • AIC does not penalize model complexity beyond the consequence of reduced predictive skill due to overfitting. One can penalize complexity by incorporating a more severe penalty that the \(2D\) term above, such as via BIC.

  • A practical approach is to use AIC, while taking care to view it as a procedure to select a reasonable predictive model and not as a formal hypothesis test.

Indirect specification of a statistical model via a simulation procedure

The likelihood for a POMP model

\[\begin{eqnarray} &&f_{X_{0:N},Y_{1:N}}(x_{0:N},y_{1:N};\theta)\\ && \quad = f_{X_0}(x_0;\theta)\,\prod_{n=1}^N\!f_{X_n | X_{n-1}}(x_n|x_{n-1};\theta)\,f_{Y_n|X_n}(y_n|x_n;\theta). \end{eqnarray}\]

Special case: deterministic latent process

  • When the latent process is non-random, the log likelihood for a POMP model closely resembles a nonlinear regression model.

  • In this case, we can write \(X_{n}=x_n(\theta)\), and the log likelihood is \[{\ell}(\theta)= \sum_{n=1}^N \log f_{Y_n|X_n}\big(y_n^*| x_n(\theta); \theta\big).\]

  • If we have a Gaussian measurement model, where \(Y_n\) given \(X_n=x_n(\theta)\) is conditionally normal with mean \(\hat{y}_n\big(x_n(\theta)\big)\) and constant variance \(\sigma^2\), then the log likelihood contains a sum of squares which is exactly the criterion that nonlinear least squares regression seeks to minimize.

  • More details on deterministic latent process models are given as a supplement.

General case: stochastic unobserved state process

  • For a POMP model, the likelihood takes the form of an integral:

\[\begin{eqnarray} {\mathcal{L}}(\theta) &=& f_{Y_{1:N}}({y^*_{1:N}};\theta) \\ &=& \! \int_{x_{0:N}} \!\! f_{X_0}(x_0;\theta)\prod_{n=1}^{N}\!f_{Y_n|X_n}({y^*_n}| x_n; \theta)\, f_{X_n|X_{n-1}}(x_n|x_{n-1};\theta)\, dx_{0:N}. \tag{L1} \end{eqnarray}\]

  • This integral is high dimensional and, except for the simplest cases, can not be reduced analytically.

Monte Carlo likelihood by direct simulation

\[\begin{eqnarray} {\mathcal{L}}(\theta) &=& f_{Y_{1:N}}({y^*_{1:N}};\theta) \\ &=& \! \int_{x_{0:N}} \left\{ \prod_{n=1}^{N}\!f_{Y_n|X_n}({y^*_n}| x_n; \theta)\right\} f_{X_{0:N}}(x_{0:N};\theta)\, dx_{0:N}. \tag{L2} \end{eqnarray}\]

Sequential Monte Carlo: The particle filter

\[\begin{eqnarray} &&f_{X_n|Y_{1:n-1}}(x_n|y^*_{1:n-1}; \theta) \\ &&\quad = \int_{x_{n-1}} \! f_{X_n|X_{n-1}}(x_n|x_{n-1};\theta)\, f_{X_{n-1}|Y_{1:n-1}}(x_{n-1}| y^*_{1:n-1}; \theta) \, dx_{n-1}. \end{eqnarray}\]

\[\begin{eqnarray} &&f_{X_n|Y_{1:n}}(x_n|y^*_{1:n}; \theta) \\ &&\quad = f_{X_n|Y_n,Y_{1:n-1}}(x_n|y^*_n,y^*_{1:n-1}; \theta) \\ &&\quad =\frac{f_{Y_n|X_n}(y^*_{n}|x_{n};\theta)\,f_{X_n|Y_{1:n-1}}(x_{n}|y^*_{1:n-1};\theta)}{\int f_{Y_n|X_n}(y^*_{n}|u_{n};\theta)\,f_{X_n|Y_{1:n-1}}(u_{n}|y^*_{1:n-1};\theta)\, du_n}. \end{eqnarray}\]

  1. Suppose \(X_{n-1,j}^{F}\), \(j=1,\dots,J\) is a set of \(J\) points drawn from the filtering distribution at time \(t_{n-1}\).

  2. We obtain a sample \(X_{n,j}^{P}\) of points drawn from the prediction distribution at time \(t_n\) by simply simulating the process model: \[X_{n,j}^{P} \sim \mathrm{process}(X_{n-1,j}^{F},\theta), \qquad j=1,\dots,J.\]

  3. Having obtained \(x_{n,j}^{P}\), we obtain a sample of points from the filtering distribution at time \(t_n\) by resampling from \(\big\{X_{n,j}^{P},j\in 1:J\big\}\) with weights \[w_{n,j}=f_{Y_n|X_n}(y^*_{n}|X^P_{n,j};\theta).\]

  4. The Monte Carlo principle tells us that the conditional likelihood \[\begin{eqnarray} {\mathcal{L}}_n(\theta) &=& f_{Y_n|Y_{1:n-1}}(y^*_n|y^*_{1:n-1};\theta) \\ &=& \int f_{Y_n|X_n}(y^*_{n}|x_{n};\theta)\,f_{X_n|Y_{1:n-1}}(x_{n}|y^*_{1:n-1};\theta)\, dx_n \end{eqnarray} \] is approximated by \[\hat{\mathcal{L}}_n(\theta) \approx \frac{1}{J}\,\sum_j\, f_{Y_n|X_n}(y^*_{n}|X_{n,j}^{P};\theta),\] since \(X_{n,j}^{P}\) is approximately a draw from \(f_{X_n|Y_{1:n-1}}(x_{n}|y^*_{1:n-1};\theta)\).

  5. We can iterate this procedure through the data, one step at a time, alternately simulating and resampling, until we reach \(n=N\).

  6. The full log likelihood then has approximation \[\begin{eqnarray}{\ell}(\theta) &=& \log{{\mathcal{L}}(\theta)} \\ &=& \sum_n \log{{\mathcal{L}}_n(\theta)} \\ &\approx& \sum_n\log\hat{\mathcal{L}}_n(\theta). \end{eqnarray} \]

Particle filtering in pomp

Here, we’ll get some practical experience with the particle filter, and the likelihood function, in the context of our influenza-outbreak case study, using the bsflu pomp object created earlier.

In pomp, the basic particle filter is implemented in the command pfilter. We must choose the number of particles to use by setting the Np argument.

pf <- pfilter(flu,Np=5000,params=c(Beta=3,mu_I=1/2,mu_R1=1/4,mu_R2=1/1.8,rho=0.9))
## [1] -217.0932

We can run a few particle filters to get an estimate of the Monte Carlo variability:

pf <- replicate(10,
ll <- sapply(pf,logLik)
##                      se 
## -212.781124    2.885455

The graph of the likelihood function: The likelihood surface

  mu_I=rep(seq(from=0.5,to=2,length=40),each=3)) -> p



foreach (theta=iter(p,"row"),.combine=rbind,
) %dopar% {
  pfilter(flu,params=unlist(theta),Np=5000) -> pf
  theta$loglik <- logLik(pf)
} -> p