Important Note: These materials have been superseded by materials provided as part of the as part of the SISMID short course on Simulation-based Inference.
Licensed under the Creative Commons Attribution-NonCommercial license. Please share and remix noncommercially, mentioning its origin.
This document has its origins in the SISMID short course on Simulation-based Inference given by Aaron King and Edward Ionides.
Produced in R version 3.3.1 using pomp version 1.7.5.1.
Dynamics of infectious disease outbreaks illustrate this well.
See Rohani and King (2010) for a review of some of the high points.
Diagram of the model:
Overdispersed binomial measurement model: casest|ΔNIR=zt∼Normal(ρzt,ρ(1−ρ)zt+(ψρzt)2)
Entry into susceptible class: μBS(t)=(1−c)B(t−τ)+cδ(t−t0)∫tt−1B(t−τ−s)ds
Force of infection μSE(t)=β(t)N(t)(I+ι)ζ(t)
ζ(t)=Gamma white noise with intensityσSE (He et al. 2010,Bhadra et al. (2011))
We’ll load the packages we’ll need, and set the random seed, to allow reproducibility. Note that we’ll be making heavy use of the data-munging methods in packages plyr and reshape2, a tutorial introduction to which is provided here. Also, we’ll be using ggplot2 for plotting: see this brief tutorial. Finally, we’ll use the convenient magrittr syntax, which is explained here.
library(pomp)
library(plyr)
library(reshape2)
library(magrittr)
library(ggplot2)
theme_set(theme_bw())
stopifnot(packageVersion("pomp")>="1.4.8")
set.seed(594709947L)
Now we’ll load the data and covariates. The data are measles reports from 20 cities in England and Wales. We also have information on the population sizes and birth-rates in these cities; we’ll treat these variables as covariates.
daturl <- "http://kingaa.github.io/pomp/vignettes/twentycities.rda"
datfile <- file.path(tempdir(),"twentycities.rda")
download.file(daturl,destfile=datfile,mode="wb")
load(datfile)
In the following, we’ll be using the plyr, reshape2, and magrittr packages. These provide a number of tools for data munging that are extremely flexible and useful. A brief tutorial on these tools is provided here.
We select the data for London and pre-process the measles and demography data.
measles %>%
mutate(year=as.integer(format(date,"%Y"))) %>%
subset(town=="London" & year>=1950 & year<1964) %>%
mutate(time=(julian(date,origin=as.Date("1950-01-01")))/365.25+1950) %>%
subset(time>1950 & time<1964, select=c(time,cases)) -> dat
demog %>% subset(town=="London",select=-town) -> demogLondon
We plot the data and covariates.
dat %>% ggplot(aes(x=time,y=cases))+geom_line()
demogLondon %>% melt(id="year") %>%
ggplot(aes(x=year,y=value))+geom_point()+
facet_wrap(~variable,ncol=1,scales="free_y")
Now, we smooth the covariates. Note that we delay the entry of newborns into the susceptible pool.
demogLondon %>%
summarize(
time=seq(from=min(year),to=max(year),by=1/12),
pop=predict(smooth.spline(x=year,y=pop),x=time)$y,
birthrate=predict(smooth.spline(x=year+0.5,y=births),x=time-4)$y
) -> covar
plot(pop~time,data=covar,type='l')
points(pop~year,data=demogLondon)
plot(birthrate~time,data=covar,type='l')
points(births~year,data=demogLondon)
plot(birthrate~I(time-4),data=covar,type='l')
points(births~I(year+0.5),data=demogLondon)
We propose a variant of the SEIR model as an explanation for these data. This is a compartmental model that, diagrammatically, looks as follows.
model diagram
B=births
S=susceptibles
E=exposed, incubating
I=infectious
R=recovered
We require a simulator for this model. The following code implements a simulator.
Notable complexities include:
cohort
) of the cohort enter the susceptible pool aall at once.rproc <- Csnippet("
double beta, br, seas, foi, dw, births;
double rate[6], trans[6];
// cohort effect
if (fabs(t-floor(t)-251.0/365.0) < 0.5*dt)
br = cohort*birthrate/dt + (1-cohort)*birthrate;
else
br = (1.0-cohort)*birthrate;
// term-time seasonality
t = (t-floor(t))*365.25;
if ((t>=7&&t<=100) || (t>=115&&t<=199) || (t>=252&&t<=300) || (t>=308&&t<=356))
seas = 1.0+amplitude*0.2411/0.7589;
else
seas = 1.0-amplitude;
// transmission rate
beta = R0*(gamma+mu)*seas;
// expected force of infection
foi = beta*pow(I+iota,alpha)/pop;
// white noise (extrademographic stochasticity)
dw = rgammawn(sigmaSE,dt);
rate[0] = foi*dw/dt; // stochastic force of infection
rate[1] = mu; // natural S death
rate[2] = sigma; // rate of ending of latent stage
rate[3] = mu; // natural E death
rate[4] = gamma; // recovery
rate[5] = mu; // natural I death
// Poisson births
births = rpois(br*dt);
// transitions between classes
reulermultinom(2,S,&rate[0],dt,&trans[0]);
reulermultinom(2,E,&rate[2],dt,&trans[2]);
reulermultinom(2,I,&rate[4],dt,&trans[4]);
S += births - trans[0] - trans[1];
E += trans[0] - trans[2] - trans[3];
I += trans[2] - trans[4] - trans[5];
R = pop - S - E - I;
W += (dw - dt)/sigmaSE; // standardized i.i.d. white noise
C += trans[4]; // true incidence
")
In the above, C represents the true incidence, i.e., the number of new infections occurring over an interval. Since recognized measles infections are quarantined, we argue that most infection occurs before case recognition so that true incidence is a measure of the number of individuals progressing from the I to the R compartment in a given interval.
We complete the process model definition by specifying the distribution of initial unobserved states. The following codes assume that the fraction of the population in each of the four compartments is known.
initlz <- Csnippet("
double m = pop/(S_0+E_0+I_0+R_0);
S = nearbyint(m*S_0);
E = nearbyint(m*E_0);
I = nearbyint(m*I_0);
R = nearbyint(m*R_0);
W = 0;
C = 0;
")
We’ll model both under-reporting and measurement error. We want E[cases|C]=ρC, where C is the true incidence and 0<ρ<1 is the reporting efficiency. We’ll also assume that Var[cases|C]=ρ(1−ρ)C+(ψρC)2, where ψ quantifies overdispersion. Note that when ψ=0, the variance-mean relation is that of the binomial distribution. To be specific, we’ll choose cases|C∼f(⋅|ρ,ψ,C), where f(c|ρ,ψ,C)=Φ(c+12,ρC,ρ(1−ρ)C+(ψρC)2)−Φ(c−12,ρC,ρ(1−ρ)C+(ψρC)2), where Φ(x,μ,σ2) is the c.d.f. of the normal distribution with mean μ and variance σ2.
The following computes P[cases|C].
dmeas <- Csnippet("
double m = rho*C;
double v = m*(1.0-rho+psi*psi*m);
double tol = 1.0e-18;
if (cases > 0.0) {
lik = pnorm(cases+0.5,m,sqrt(v)+tol,1,0)-pnorm(cases-0.5,m,sqrt(v)+tol,1,0)+tol;
} else {
lik = pnorm(cases+0.5,m,sqrt(v)+tol,1,0)+tol;
}
")
The following codes simulate cases|C.
rmeas <- Csnippet("
double m = rho*C;
double v = m*(1.0-rho+psi*psi*m);
double tol = 1.0e-18;
cases = rnorm(m,sqrt(v)+tol);
if (cases > 0.0) {
cases = nearbyint(cases);
} else {
cases = 0.0;
}
")
pomp
objectWe put all the model components together with the data in a call to pomp
:
dat %>%
pomp(t0=with(dat,2*time[1]-time[2]),
time="time",
rprocess=euler.sim(rproc,delta.t=1/365.25),
initializer=initlz,
dmeasure=dmeas,
rmeasure=rmeas,
covar=covar,
tcovar="time",
zeronames=c("C","W"),
statenames=c("S","E","I","R","C","W"),
paramnames=c("R0","mu","sigma","gamma","alpha","iota",
"rho","sigmaSE","psi","cohort","amplitude",
"S_0","E_0","I_0","R_0")
) -> m1
The following codes plot the data and covariates together.
m1 %>% as.data.frame() %>%
melt(id="time") %>%
ggplot(aes(x=time,y=value))+
geom_line()+
facet_grid(variable~.,scales="free_y")
He et al. (2010) estimated the parameters of this model. The full set is included in the R code accompanying this document, where they are read into a data frame called mles
.
mles %>% subset(town=="London") -> mle
paramnames <- c("R0","mu","sigma","gamma","alpha","iota",
"rho","sigmaSE","psi","cohort","amplitude",
"S_0","E_0","I_0","R_0")
mle %>% extract(paramnames) %>% unlist() -> theta
mle %>% subset(select=-c(S_0,E_0,I_0,R_0)) %>%
knitr::kable(row.names=FALSE)
town | loglik | loglik.sd | mu | delay | sigma | gamma | rho | R0 | amplitude | alpha | iota | cohort | psi | sigmaSE |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
London | -3804.9 | 0.16 | 0.02 | 4 | 28.9 | 30.4 | 0.488 | 56.8 | 0.554 | 0.976 | 2.9 | 0.557 | 0.116 | 0.0878 |
Verify that we get the same likelihood as He et al. (2010).
library(foreach)
library(doParallel)
registerDoParallel()
set.seed(998468235L,kind="L'Ecuyer")
foreach(i=1:4,
.packages="pomp",
.options.multicore=list(set.seed=TRUE)
) %dopar% {
pfilter(m1,Np=10000,params=theta)
} -> pfs
logmeanexp(sapply(pfs,logLik),se=TRUE)
## se
## -3800.9542945 0.3404236
Simulations at the MLE:
m1 %>%
simulate(params=theta,nsim=9,as.data.frame=TRUE,include.data=TRUE) %>%
ggplot(aes(x=time,y=cases,group=sim,color=(sim=="data")))+
guides(color=FALSE)+
geom_line()+facet_wrap(~sim,ncol=2)
The parameters are constrained to be positive, and some of them are constrained to lie between 0 and 1. We can turn the likelihood maximization problem into an unconstrained maximization problem by transforming the parameters. The following Csnippets implement such a transformation and its inverse.
toEst <- Csnippet("
Tmu = log(mu);
Tsigma = log(sigma);
Tgamma = log(gamma);
Talpha = log(alpha);
Tiota = log(iota);
Trho = logit(rho);
Tcohort = logit(cohort);
Tamplitude = logit(amplitude);
TsigmaSE = log(sigmaSE);
Tpsi = log(psi);
TR0 = log(R0);
to_log_barycentric (&TS_0, &S_0, 4);
")
fromEst <- Csnippet("
Tmu = exp(mu);
Tsigma = exp(sigma);
Tgamma = exp(gamma);
Talpha = exp(alpha);
Tiota = exp(iota);
Trho = expit(rho);
Tcohort = expit(cohort);
Tamplitude = expit(amplitude);
TsigmaSE = exp(sigmaSE);
Tpsi = exp(psi);
TR0 = exp(R0);
from_log_barycentric (&TS_0, &S_0, 4);
")
pomp(m1,toEstimationScale=toEst,
fromEstimationScale=fromEst,
statenames=c("S","E","I","R","C","W"),
paramnames=c("R0","mu","sigma","gamma","alpha","iota",
"rho","sigmaSE","psi","cohort","amplitude",
"S_0","E_0","I_0","R_0")) -> m1
The linked document shows how a likelihood profile can be constructed using IF2.
force of infection=μSE=β(t)N(t)(I+ι)ζ(t)
Profile likelihood for birth-cohort delay, showing 95% and 99% critical values of the log likelihood.
pop | R0 | amplitude | LP | IP | alpha | iota | rho | psi | sigmaSE | |
---|---|---|---|---|---|---|---|---|---|---|
Halesworth | 2170 | 33.100 | 0.381 | 7.870 | 2.290 | 0.948 | 0.00912 | 0.754 | 0.641 | 0.0748 |
Lees | 4250 | 29.700 | 0.153 | 8.510 | 2.050 | 0.968 | 0.03110 | 0.612 | 0.681 | 0.0802 |
Mold | 6410 | 21.400 | 0.271 | 5.930 | 1.780 | 1.040 | 0.01450 | 0.131 | 2.870 | 0.0544 |
Dalton.in.Furness | 10600 | 28.300 | 0.203 | 5.480 | 1.980 | 0.989 | 0.03860 | 0.455 | 0.818 | 0.0779 |
Oswestry | 11000 | 52.900 | 0.339 | 10.300 | 2.710 | 1.040 | 0.02980 | 0.631 | 0.476 | 0.0699 |
Northwich | 18300 | 30.100 | 0.423 | 8.510 | 3.020 | 0.948 | 0.06020 | 0.795 | 0.402 | 0.0857 |
Bedwellty | 28900 | 24.700 | 0.160 | 6.820 | 3.030 | 0.937 | 0.03960 | 0.311 | 0.951 | 0.0611 |
Consett | 39100 | 35.900 | 0.200 | 9.080 | 2.660 | 1.010 | 0.07310 | 0.650 | 0.406 | 0.0712 |
Hastings | 65700 | 34.200 | 0.299 | 7.000 | 5.440 | 1.000 | 0.18600 | 0.695 | 0.396 | 0.0955 |
Cardiff | 245000 | 34.400 | 0.223 | 9.870 | 3.090 | 0.996 | 0.14100 | 0.602 | 0.270 | 0.0539 |
Bradford | 294000 | 32.100 | 0.236 | 8.510 | 3.360 | 0.991 | 0.24400 | 0.599 | 0.190 | 0.0451 |
Hull | 302000 | 38.900 | 0.221 | 9.180 | 5.460 | 0.968 | 0.14200 | 0.582 | 0.256 | 0.0636 |
Nottingham | 307000 | 22.600 | 0.157 | 5.720 | 3.700 | 0.982 | 0.17000 | 0.609 | 0.258 | 0.0380 |
Bristol | 443000 | 26.800 | 0.203 | 6.190 | 4.940 | 1.010 | 0.44100 | 0.626 | 0.201 | 0.0392 |
Leeds | 510000 | 47.800 | 0.267 | 9.480 | 10.900 | 1.000 | 1.25000 | 0.666 | 0.167 | 0.0778 |
Sheffield | 515000 | 33.100 | 0.313 | 7.230 | 6.380 | 1.020 | 0.85300 | 0.649 | 0.175 | 0.0428 |
Manchester | 704000 | 32.900 | 0.290 | 11.100 | 6.940 | 0.965 | 0.59000 | 0.550 | 0.161 | 0.0551 |
Liverpool | 802000 | 48.100 | 0.305 | 7.900 | 9.800 | 0.978 | 0.26300 | 0.494 | 0.136 | 0.0533 |
Birmingham | 1120000 | 43.400 | 0.428 | 8.510 | 11.600 | 1.010 | 0.34300 | 0.544 | 0.178 | 0.0611 |
London | 3390000 | 56.800 | 0.554 | 13.100 | 12.500 | 0.976 | 2.90000 | 0.488 | 0.116 | 0.0878 |
r | NA | 0.455 | 0.301 | 0.322 | 0.946 | 0.106 | 0.93200 | -0.203 | -0.931 | -0.3260 |
r=cor(logˆθ,logN1950)
μSE=β(t)N(t)(I+ι)ζ(t)
Simulations at the MLE:
m1 %>%
simulate(params=theta,nsim=100,as.data.frame=TRUE,include.data=TRUE) %>%
subset(select=c(time,sim,cases)) %>%
mutate(data=sim=="data") %>%
ddply(~time+data,summarize,
p=c(0.05,0.5,0.95),q=quantile(cases,prob=p,names=FALSE)) %>%
mutate(p=mapvalues(p,from=c(0.05,0.5,0.95),to=c("lo","med","hi")),
data=mapvalues(data,from=c(TRUE,FALSE),to=c("data","simulation"))) %>%
dcast(time+data~p,value.var='q') %>%
ggplot(aes(x=time,y=med,color=data,fill=data,ymin=lo,ymax=hi))+
geom_ribbon(alpha=0.2)
Modify the He et al. (2010) model to remove the cohort effect. Run simulations and compute likelihoods to convince yourself that the resulting codes agree with the original ones for cohort = 0
.
Now modify the transmission seasonality to use a sinusoidal form. How many parameters must you use? Fixing the other parameters at their MLE values, compute and visualize a profile likelihood over these parameters.
Set the extrademographic stochasticity parameter σSE=0, set α=1, and fix ρ and ι at their MLE values, then maximize the likelihood over the remaining parameters. How do your results compare with those at the MLE? Compare likelihoods but also use simulations to diagnose differences between the models.
Bhadra, A., E. L. Ionides, K. Laneri, M. Pascual, M. Bouma, and R. Dhiman. 2011. Malaria in Northwest India: Data analysis via partially observed stochastic differential equation models driven by Lévy noise. Journal of the American Statistical Association 106:440–451.
He, D., E. L. Ionides, and A. A. King. 2010. Plug-and-play inference for disease dynamics: Measles in large and small populations as a case study. Journal of the Royal Society, Interface 7:271–283.
Rohani, P., and A. A. King. 2010. Never mind the length, feel the quality: The impact of long-term epidemiological data sets on theory, application and policy. Trends in Ecology & Evolution 25:611–618.