\documentclass{article} \usepackage{sober} \usepackage{hyperref} \usepackage{natbib} \bibliographystyle{ESA1009} \title{Likelihood, Bayes, and all that, in an epidemiological context} \author{Ben Bolker} \date{\today} \newcommand{\code}[1]{{\tt #1}} \begin{document} \maketitle \includegraphics[width=2.64cm,height=0.93cm]{cc-attrib-nc.png} \begin{minipage}[b]{3in} {\tiny Licensed under the Creative Commons attribution-noncommercial license (\url{http://creativecommons.org/licenses/by-nc/3.0/}). Please share \& remix noncommercially, mentioning its origin.} \end{minipage} \SweaveOpts{height=5,width=10} \setkeys{Gin}{width=\textwidth} <>= x11(width=10,height=6) ## hack for figure margins too small error par(bty="l",las=1) options(continue=" ") @ <>= options(SweaveHooks=list( fig=function() par(mfrow=c(1,2)) )) @ <>= options(SweaveHooks=list( fig=function() par(mfrow=c(2,2)) )) @ <>= options(SweaveHooks=list( fig=function() par(mar=c(5,13,4,11)+0.1)) ) @ \section{Likelihood} \begin{itemize} \item{sums of squares are fine, but may want a goodness-of-fit metric that is tied to a particular model (e.g. chain binomial) of how an epidemic works} \item{\emph{likelihood}: \textbf{probability of observed data occurring given a particular model (= set of parameters)}: write this as $\mbox{Prob}(\mbox{data}|\mbox{model})$ (also sometimes stated as ``probability given a hypothesis'')} \item{Niamey example: 27 cases in week 2, commune 1. \emph{Suppose} that there are 8000 susceptibles in the population. If the attack rate (\emph{per capita} infection probability) is 0.002, what is the likelihood (probability of 27 cases)? \emph{Answer}: \Sexpr{round(dbinom(27,prob=0.002,size=8000),4)} (how did I get this answer (mathematically? in R?))} \item{the \emph{maximum likelihood estimate} (MLE) is the value of the parameter that makes the data most likely to have occurred. In the particular case of binomial data, we could do a little bit of calculus to show that the MLE in this case is (number of cases)/(number of susceptibles), which is common sense} \item{the \emph{likelihood curve} shows the likelihood for a range of possible parameter values. We often draw the \emph{log-likelihood curve} (or the \emph{negative log-likelihood curve}) instead, for convenience/historical reasons. <>= <> op <- par(las=1) curve(dbinom(27,prob=x,size=8000), from=0,to=0.008, xlab="Attack rate",ylab="Likelihood") curve(dbinom(27,prob=x,size=8000,log=TRUE), from=0,to=0.008, xlab="Attack rate",ylab="Log-likelihood") par(op) @ } \item{Likelihood \emph{surfaces}: more than one parameter (in this case number of susceptibles and force of infection) <>= <> library(emdbook) op <- par(las=1) brvec <- c(-30,seq(-20,-2,by=2)) z <- curve3d(dbinom(27,prob=x,size=y,log=TRUE), from=c(0.002,5000),to=c(0.006,12000), sys3d="image",xlab="attack rate",ylab="") ## breaks=brvec, ## color=heat.colors(length(brvec)-1)) mtext(side=2,line=4,"# susceptibles") points(27/8000,8000,pch=16) @ } \item{intuition: the shape of the likelihood surface (particularly its steepness) tells us about the uncertainty in our parameters. There are a lot of details here that we won't go into (much), because they are handled in a slightly different way by Bayesians} \item{we could translate this problem into a question about the contact rate $\beta$ rather than the attack rate by saying $p= 1-\exp(-\beta I(t-1)/N)$ --- in this case our estimate of $\hat p=\Sexpr{27/8000}$ would translate to $\hat \beta=-\log(1-\hat p) N/I(t-1) \approx \hat p N/I(t-1) = \hat p \times (300,000)/22 = \Sexpr{round(27/8000*3e5/22,2)}$.} \item{to solve non-trivial problems (with more than a single data point) we usually assume that the data points are all \emph{independent}, in which case the overall likelihood is the product of the individual likelihoods, or the overall log-likelihood is the sum of the log-likelihoods} \item{it also turns out that for the special case of a normal distribution, the MLE is equivalent to least-squares fitting. Suppose we have data $y_i$ and are fitting a function to compute the expected values of $\mu_i$ (e.g. $\mu_i=a+b x_i$). The normal probability distribution is $C \cdot \exp(-(x_i-\mu_i)^2/(2 \sigma^2))$, where $C$ is some ugly stuff ($1/(\sqrt{2 \pi} \sigma)$). The logarithm is $\log(C) - (x_i-\mu_i)^2/(2 \sigma^2)$. If we sum up the negative log-likelihoods (so we will want to minimize rather than maximize) we get $-N \log(C) + 1/(2 \sigma^2) \sum (x_i-\mu_i)^2$. If all we care about is minimizing, we can ignore the first term and the multiplier in the second term --- we just have to minimize $\sum (x_i-\mu_i)^2$, which is just the sum of squared errors. (We often ignore \emph{normalization constants} such as $1/\sqrt{2 \pi}$, which don't depend on the parameters \ldots)} \end{itemize} \subsection{Example} We can read the data straight from the web, if we know where it is: <>= daturl <- "http://www.umich.edu/~kingaa/EEID/Ecology/commune.measles.csv" niamey <- read.csv(url(daturl), header=FALSE) @ Or we can read it from the working directory: <<>>= niamey <- read.csv("commune.measles.csv") cases <- niamey[,1] @ Suppose that we know the starting number of cases in commune 1, reconstruct the number of susceptibles at the \emph{previous} period in each case: <<>>= S0 <- 8000 n <- length(cases) cumcases <- cumsum(cases) Susc <- S0-c(0,cumcases[1:(n-1)]) ## figure out S(t-1) @ For a given value of $\beta$, the force of infection is $\beta I$ and the attack rate is $1-\exp(-\beta I \Delta t) = 1-\exp(-\beta I)$ (because we have biweekly data, $\Delta t=1$). For a given value of $\beta$, say $\beta=50$, we can compute the probability of 27 cases given 8000 susceptibles and 22 cases in the previous generation: <<>>= popsize=3e5 beta <- 50 dbinom(cases[2],size=Susc[1], prob=1-exp(-beta*cases[1]/popsize)) @ Or to calculate the log-likelihood: <<>>= dbinom(cases[2],size=Susc[1], prob=1-exp(-beta*cases[1]/popsize),log=TRUE) @ There are a few ways we could figure out the overall likelihood, or log-likelihood, for a given value of $\beta$. Read them over and try \textbf{one} of them. \begin{enumerate} \item{Define a starting value for {\tt likelihood} of 1. Then use a {\tt for} loop stepping from 2 to n. At each step, compute the likelihood {\tt curr\_lik} using the current cases, previous cases, and previous susceptibles as shown above, and update the likelihood: {\tt likelihood <- likelihood * curr\_lik}} \item{Do the same for log-likelihoods, starting from 0 rather than 1 and adding the log-likelihood instead of multiplying likelihoods.} \item{Set up a vector of cases from time 2 to $n$ by dropping the first case: {\tt cases[-1]} (or {\tt cases[2:n]}). Then set up the vector of cases from time 1 to $n-1$ ({\tt cases[1:(n-1)]} or {\tt cases[-n]}) and the vector of susceptibles. Now you can compute $1-\exp(-\beta I(t-1))$ (for a given value of $\beta$) in a single, vectorized statement, and you can feed this expected-attack-rate vector and the matching numbers of susceptibles and cases into \code{dbinom} and get a vector of attack rates: \code{prod} takes the product of a vector} \item{as previous item, but with log-likelihoods (add the {\tt log=TRUE} argument to the \code{dbinom} call) and use \code{sum} instead of \code{prod}} \end{enumerate} If you have time: write another \code{for} loop, over different values of $\beta$ (try values ranging from 5 to 200 in steps of 5), to compute the likelihood curve for $\beta$. The result should look like this: <>= betavec <- seq(5,200,by=5) llikvec <- sapply(betavec, function(b) { sum(dbinom(cases[-1], size=Susc[-n], prob=1-exp(-b*cases[-n]/popsize), log=TRUE)) }) plot(betavec,llikvec,xlab=expression(beta),las=1,ylab="") mtext(side=2,line=4,"log-likelihood") @ \newpage \section{Bayes} \subsection{Bayes' rule} \begin{itemize} \item{\emph{Bayes' rule}: just a rule about how to figure out $\mbox{Prob}(H|D)$ from $\mbox{Prob}(D|H)$ (fairly easy to derive): \begin{equation} P(H | D) = \frac{P(D|H) P(H)}{P(D)} \end{equation} or \begin{equation} P(H_i | D) = \frac{P(D | H_i) P(H_i)}{\sum_j P(H_j) P(D|H_j)} \end{equation} <>= r <- 0 d <- acos(r) scale <- c(0.5,0.3) npoints <- 100 centre <- c(0.5,0.5) a <- seq(0, 2 * pi, len = npoints) m <- matrix(c(scale[1] * cos(a + d/2) + centre[1], scale[2] * cos(a - d/2) + centre[2]), npoints, 2) e <- 0.05 hyp_pts <- matrix(c(0.37,1.04, 1+e,0.8+e, 1,-e, 0.4,-e, -e,0.25), byrow=TRUE,ncol=2) lab.pts <- matrix(c(0.091,0.255,0.597,0.557, 0.869,0.709,0.549,0.511, 0.170,0.22, ##y 0.865,0.613, 0.932,0.698,0.191,0.477, 0.087,0.277,0.077,0.31), ncol=2) ##hyp_pts <- hyp_pts[c(5,1:4),] ## lab.pts <- lab.pts[c(5,1:4),] ## par(mar=c(0.2,0.2,0.2,0.2)) plot(1,1,type="n",xlim=c((-e),1+e),ylim=c(-e,1+e),ann=FALSE, xlab="",ylab="",axes=FALSE,xaxs="i",yaxs="i") box() polygon(m,col="lightgray",lwd=2) polygon(c(-e,0.5,0.4,-e),c(0.25,0.5,-e,-e),density=8,angle=0, col="darkgray") lines(m,lwd=2) segments(rep(0.5,nrow(hyp_pts)),rep(0.5,nrow(hyp_pts)), hyp_pts[,1],hyp_pts[,2]) ##text(lab.pts[,1],lab.pts[,2],1:10) for(i in 1:5) { r <- 2*i-1 r2 <- 2*i text(lab.pts[r,1],lab.pts[r,2], substitute(H[x], list(x=i)),adj=0,cex=2) text(lab.pts[r2,1],lab.pts[r2,2], substitute(D*intersect("","","")*H[x], list(x=i)),adj=0,cex=2) } @ } \item{(false positive/medical testing/forensic example, if time permits)} \end{itemize} \subsection{Bayesian inference} \begin{itemize} \item{interpret $P(H|D)$ as \emph{the probability of the ``hypothesis'' given data}, where ``hypothesis'' can mean (as in the discussion of likelihood above) a model, or for $P(H_i|D)$, $H_i$ is a particular value of the parameters; this is called the \textbf{posterior probability}, the distribution is the \textbf{posterior distribution}.} \item{then $P(D)$ is the likelihood \ldots} \item{but what the hell is $P(H_i)$? The \textbf{prior}.} \item{the denominator, $\sum P(H_j) P(D|H_j)$ or (for continuously distributed parameters $\int P(p) P(D|p) \, dp$) is $P(D)$, the probability of having gotten the data \emph{somehow}} \item{if all of the $P(H_i)$ are the same then they all cancel out of the Bayes' rule formula (and we get $P(H_i|D) = P(D|H_i)/\sum_j P(D|H_j)$, sometimes called the \emph{scaled likelihood}) --- then the shape of the posterior distributions is the same as the shape of the likelihood curve (and the maximum (\emph{mode}) of the posterior distribution is the same as the MLE)} \end{itemize} Bayesians usually summarize the results of an analysis via the \emph{posterior means} of the parameters (sometimes the \emph{posterior modes}) and the \emph{quantiles} or \emph{credible intervals} of the \emph{marginal distributions} (give details if time). Computing the marginal posterior distributions is a big pain (integrals!) but can be avoided by using the techniques Aaron will discuss. \subsubsection{Pros and cons} Three reasons to be Bayesian: \begin{enumerate} \item{\emph{philosophical}: using a Bayesian framework gives you the ability to make inferences about what you \emph{really} (arguably) want to know, the probability of a given parameter or model, rather than jumping through the semantic hoops required by frequentist inference} \item{\emph{using prior information}: in sparse-data cases (conservation biology, wildlife disease) we can introduce extra information that we already know, in a very natural way, through the prior distribution (\cite{McCarthy2007} shows lots of nice examples)} \item{\emph{pragmatic}: a lot of problems that are very hard to solve in classical ways become easier in a Bayesian framework. For example: ``mixed'' models (involving random effects); models with process and measurement error, or unobserved states; etc.} \end{enumerate} Three reasons \emph{not} to be Bayesian: \begin{enumerate} \item{\emph{philosophical}: the Bayesian framework requires you to specify your prior distribution, which most of the time is a subjective statement of your personal belief about what the parameter might be. Some researchers hate this subjectivity \citep{Dennis1996}. You can try to make the priors \emph{weak}, or \emph{uninformative} (\emph{flat} is another synonym), but this is hard for various reasons.} \item{\emph{pragmatic (``too hard'')}: using Bayesian approaches requires more technical overhead than classical approaches (partly, but only partly, because most canned statistical software is written to apply classical approaches). Very hard problems are easier, but moderately hard problems are harder \ldots the main problem is computing integrals (Aaron's lecture will talk about how to do \emph{avoid} doing the integrals)} \item{\emph{pragmatic (``too easy'')}: in practice, one can almost always get an answer to a problem (even very hard ones) using modern Bayesian methods, but sometimes the answers make no sense --- where classical methods would just fail (!!)} \end{enumerate} Many statistical practitioners switch back and forth among approaches depending on what works best in a particular case. Fewer and fewer \emph{real} statisticians are rabid about the distinction, although some have strong preferences \ldots for an amusing recent entry into the debate, see \cite{gelman_objections_2008}. \subsubsection{Priors and conjugate priors} For a single-parameter problem, we can get the general shape of the posterior (ignoring the denominator) easily. Let's go back to the attack rate problem we started with (for 27 cases out of 10,000 susceptibles, what can we say about the attack rate?) Say we use a flat prior, where the prior probability of any attack rate is equal: then (as stated above) the prior terms cancel out, and the posterior is equal to the scaled likelihood. Let's try varying the prior. Location of the peak tells us something about our estimate, width of the peak tells us something about our certainty. <>= ## tmpf <- function(a,b,N=27,minp=1e-4,maxp=0.01,dp=1e-4, ## size=1e4,scaled=TRUE) { ## op <- par(mfrow=c(1,2)) ## on.exit(par(op)) ## pvec = seq(minp,maxp,by=dp) ## likvec = dbinom(N,size=size,prob=pvec) ## prior = dbeta(pvec,a,b) ## 2*(1-pvec) ## plot(pvec,prior,type="l",main="prior") ## post = prior*likvec ## ylab <- "posterior" ## if (scaled) { ## ylab <- paste(ylab,"(unscaled)") ## post <- post/(sum(prior*likvec)*dp) ## } ## plot(pvec,likvec,type="l",main="likelihood & posterior",ylab="", ## col=2) ## mtext(side=2,col=2,"likelihood",line=3,las=0) ## par(new=TRUE) ## plot(pvec,post,type="l",col=4,axes=FALSE,xlab="",ylab="") ## axis(side=4,col=4) ## mtext(side=4,col=4,ylab,las=0,line=3,xpd=NA) ## } tmpf <- function(a,b,N=27,minp=1e-4,maxp=0.01,dp=1e-4, size=1e4,scaled_prior=TRUE, scaled_lik=TRUE,main="") { pvec = seq(minp,maxp,by=dp) likvec = dbinom(N,size=size,prob=pvec) prior = dbeta(pvec,a,b) ## 2*(1-pvec) ## plot(pvec,prior,type="l",main="prior") post = prior*likvec ylab <- "posterior" if (scaled_prior) { ylab <- paste(ylab,"(unscaled)") post <- post/(sum(prior*likvec)*dp) } if (scaled_lik) { likvec <- likvec/(sum(likvec)*dp) } matplot(pvec,cbind(likvec,prior,post), type="l",ylab="probability/likelihood", col=c(1,2,4),lty=1,main=main) legend("right", col=c(1,2,4),lty=1, c("likelihood","prior","posterior")) } @ \SweaveOpts{height=8,width=8} <>= <> tmpf(1,1,main="flat prior") tmpf(0.001,1000,maxp=0.01,main=expression(prior %->% 0)) tmpf(100,8000,maxp=0.02,main=expression(prior %->% 0.1)) tmpf(27,8000,maxp=0.02,main="prior agrees with data") @ \SweaveOpts{height=5,width=10} \cite{Crome+1996} give a nice example of contrasting the effects of different priors in a conservation context (effects of logging on bird communities). \subsection{Conjugate priors} \emph{Conjugate priors} are special forms of the priors such that the combination of the prior and the likelihood (i.e. the outcome of Bayes' rule) ends up having the same distribution as the prior. For example, if you start with a normally distributed prior mean, and your data are normal, then the posterior distribution of the mean is also normal --- but with different (updated) parameters, in a sensible way \begin{equation} \mu_{\mbox{\tiny post}} = (\mu_{\mbox{\tiny dat}}/\sigma^2_{\mbox{\tiny dat}} + \mu_{\mbox{\tiny prior}}/\sigma^2_{\mbox{\tiny prior}})/ (1/\sigma^2_{\mbox{\tiny dat}} + 1/\sigma^2_{\mbox{\tiny prior}}) \end{equation} \begin{tabular}{llp{4in}} \textbf{Data} & \textbf{Prior} & \textbf{Meaning} \\ Binomial ($p$) & Beta & $a \to a+k$=\# successes, $b \to b+(N-k)$=\# failures \\ Poisson ($\lambda$) & Gamma & mean (shape $\cdot$ scale) = prior mean intensity, shape = \# counts \\ Normal ($\mu$) & Normal & Mean and standard dev of prior obs. \\ Normal ($\sigma^2$) & Inverse-gamma & \ \end{tabular} Also see \url{http://en.wikipedia.org/wiki/Conjugate_prior#Table_of_conjugate_distributions} Conjugate priors are useful for illustrating/understanding the effects of priors (see above); they are most useful as \emph{components} of more complicated Bayesian solutions. \subsubsection{Example} Play around with conjugate priors for the binomial distribution (Beta). Use the \code{curve} function: for example, consider binomial data with 5 successes and 10 failures (total sample $N=15$, \# successes $k=5$, failures $N-k=10$) and a prior distribution of 10 successes and 5 failures ($a=10$, $b=5$). The posterior is $a=15$, $b=15$ (R uses \code{shape1} and \code{shape2} to denote these parameters.) <>= curve(dbeta(x,shape1=5,shape2=10),col=1,from=0,to=1, ylim=c(0,5)) ## prior curve(dbinom(5,prob=x,size=15),col=2,add=TRUE) ## likelihood curve(dbeta(x,shape1=15,shape2=15),col=4,add=TRUE) ## posterior legend("topleft",c("prior","likelihood","posterior"), col=c(1,2,4),lty=1) @ Experiment with: \begin{itemize} \item weak priors ($a=1$, $b=1$) \item strong priors that agree with the data (e.g. $a=5$, $b=10$ (mean prior prob=2/3), $k=5$, $N=15$) \item strong priors that disagree with the data (e.g. $a=10$, $b=5$ (mean prior prob=1/3), $k=5$, $N=15$) \item data much stronger than prior (e.g. as above two examples but use $k=25$, $N=75$) \end{itemize} \bibliography{bookbib} \end{document}