<>= source("chapskel.R") library(emdbook) @ %\todo{add other Bayesian refs: Ellison, books, etc.} \section*{Summary} This chapter continues to review the math you need to fit models to data, moving forward from functions and curves to probability distributions. The first part discusses ecological variability in general terms, then reviews basic probability theory and some important applications, including Bayes' Rule and its application in statistics. The second part reviews how to analyze and understand probability distributions. The third part provides a bestiary of probability distributions, finishing with a short digression on some ways to extend these basic distributions. \section{Introduction: why does variability matter?} For many ecologists and statisticians, noise is just a nuisance --- it gets in the way of drawing conclusions from the data. The traditional statistical approach to noise in data was to assume that all variation in the data was normally distributed, or transform the data until it was, and then use classical methods based on the normal distribution to draw conclusions. Some scientists turned to nonparametric statistics, %which rely only on relative ranks of the data in different categories which assume only that the shape of the data distribution is the same in all categories and provide tests of differences in the means or ``location parameters'' among categories. %models may be less powerful, %and they Unfortunately, classical nonparametric approaches make it much harder to draw quantitative conclusions from data (rather than simply rejecting or failing to reject null hypotheses about differences between groups). In the 1980s, as they acquired better computing tools, ecologists began to use more sophisticated models of variability such as generalized linear models (see Chapter~\ref{chap:std}). Chapter~\ref{chap:determ} illustrated a wide range of deterministic functions that correspond to eterministic models of the underlying ecological processes. This chapter will illustrate a wide range of models for the stochastic part of the dynamics. In these models, variability isn't just a nuisance, but actually tells us something about ecological processes. For example, census counts that follow a negative binomial distribution (p.~\pageref{sec:negbinom}) tell us there is some form of environmental variation or aggregative response among individuals that we haven't taken into account \citep{ShawDobson1995}. Remember from Chapter~\ref{chap:intro} that what we treat as ``signal'' (deterministic) and what we treat as ``noise'' (stochastic) depends on the question. The same ecological variability, such as spatial variation in light, might be treated as random variation by a forester interested in the net biomass increment of a forest stand and as a deterministic driving factor by an ecophysiologist interested in the photosynthetic response of individual plants. Noise affects ecological data in two different ways --- as \emph{measurement error} and as \emph{process noise} (this will become important in Chapter~\ref{chap:dyn} when we deal with dynamical models). Measurement error is the variability or ``noise'' in our measurements, which makes it hard to estimate parameters and make inferences about ecological systems. Measurement error leads to large confidence intervals and low statistical power. Even if we can eliminate measurement error, process noise or process error (often so-called even though it isn't technically an ``error'', but a real part of the system) still exists. Variability affects any ecological system. For example, we can observe thousands of individuals to determine the average mortality rate with great accuracy. The fate of a group of a few individuals, however, depends both on the variability in mortality rates of individuals and on the \emph{demographic stochasticity} that determines whether a particular individual lives or dies (``loses the coin toss''). Even though we know the average mortality rate perfectly, our predictions are still uncertain. \emph{Environmental stochasticity} --- spatial and temporal variability in (e.g.) mortality rate caused by variation in the environment rather than by the inherent randomness of individual fates --- also affects the dynamics. %% cite Fox? %(The dichotomy between measurement error and process error is not %exactly the same as that between problems of inference and prediction, %but it is related.) Finally, even if we can minimize measurement error by careful measurement and minimize process noise by studying a large population in a constant environment (i.e. low levels of demographic and environmental stochasticity), ecological systems can still amplify variability in surprising ways \citep{BjornstadGrenfell2001}. For example, a tiny bit of demographic stochasticity at the beginning of an epidemic can trigger huge variation in epidemic dynamics \citep{RandWilson1991}. Variability also feeds back to change the mean behavior of ecological systems. For example, in the damselfish system described in Chapter~\ref{chap:explore} the number of recruits in any given cohort is the number of settlers surviving density-dependent mortality, but the average number of recruits is \emph{lower} than expected from an average-sized cohort of settlers because large cohorts suffer disproportionately high mortality and contribute relatively little to the average. This widespread phenomenon follows from \emph{Jensen's inequality} \citep{RuelAyres1999,Inouye2005}. \section{Basic probability theory} In order to understand stochastic terms in ecological models, you'll have to (re)learn some basic probability theory. To define a probability, we first have to identify the \emph{sample space}, the set of all the possible outcomes that could occur. Then the probability of an event $A$ is the frequency with which that event occurs. A few probability rules are all you need to know: %\newcommand{\prob}{\mbox{Prob}} % defined in book preamble \begin{enumerate} \item{If two events are \emph{mutually exclusive} (e.g., ''individual is male'' and ''individual is female'') then the probability that either occurs (the probability of $A$ \emph{or} $B$, or $\prob(A \cup B)$) is the sum of their individual probabilities: e.g. $\prob(\mbox{male or female}) = \prob(\mbox{male})+\prob(\mbox{female})$. We use this rule, for example, in finding the probability that an outcome is within a certain numeric range by adding up the probabilities of all the different (mutually exclusive) values in the range: for a discrete variable, for example, $P(3 \le X \le 5)= P(X=3)+P(X=4)+P(X=5)$. } \item{If two events $A$ and $B$ are not mutually exclusive --- the \emph{joint probability} that they occur together, $\prob(A \cap B)$, is greater than zero --- % then we have to correct the rule for combining probabilities to account for double-counting: $$ \prob(A \cup B) = \prob(A) + \prob(B) - \prob(A \cap B). $$ For example if we are tabulating the color and sex of animals, $\prob(\mbox{blue or male}) = \prob(\mbox{blue}) + \prob(\mbox{male}) - \prob(\mbox{blue male})$) } \item{The probabilities of all possible outcomes of an observation or experiment add to 1.0. ($\prob(\mbox{male})+\prob(\mbox{female})=1.0$.) We will need this rule to understand the form of probability distributions, which often contain a \emph{normalization constant} to make sure that the sum of the probabilities of all possible outcomes is 1. } \item{ The \emph{conditional probability} of $A$ \emph{given} $B$, $\prob(A|B)$, is the probability that $A$ happens if we know or assume $B$ happens. The conditional probability equals \begin{equation} \prob(A|B) = \prob(A \cap B)/\prob(B). \label{eq:cprob} \end{equation} For example: \begin{equation} \prob(\mbox{individual is blue|individual is male})= \frac{\prob(\mbox{individual is a blue male})}{\prob(\mbox{individual is male})}. \end{equation} By contrast, we may also refer to the probability of $A$ when we make no assumptions about $B$ as the \emph{unconditional} probability of $A$. $\prob(A)=\prob(A|B) + \prob(A|\mbox{not } B)$. Conditional probability is central to understanding \emph{Bayes' Rule} (p.~\pageref{bayesrule}). } \item{If the conditional probability of $A$ given $B$, $\prob(A|B)$, equals the unconditional probability of $A$, then $A$ is \emph{independent} of $B$. Knowing about $B$ provides no information about the probability of $A$. Independence implies that \begin{equation} \prob(A \cap B) = \prob(A) \prob(B), \end{equation} which follows from multiplying both sides of (\ref{eq:cprob}) by $\prob(B)$. The probabilities of combinations of independent events are multiplicative. Multiplying probabilities of independent events, or adding independent log-probabilities ($\log(\prob(A \cap B)) = \log(\prob(A))+ \log(\prob(B))$ if $A$ and $B$ are independent), is how we find the combined probability of a series of observations. } \end{enumerate} We can immediately use these rules to think about the distribution of seeds taken in the seed removal experiment (Chapter~\ref{chap:explore}). The most obvious pattern in the data is that there are many zeros, probably corresponding to times when no predators visited the station. The sample space for seed disappearance --- is the number of seeds taken, from 0 to $N$ (the number available). Suppose that when a predator \emph{did} visit the station, with probability $v$, it had an equal probability of taking any of the possible number of seeds (a uniform distribution from 0 to $N$). Since the probabilities must add to 1, this probability ($\prob(x\mbox{ taken}|\mbox{predator visits})$) is $1/(N+1)$ (0 to $N$ represents $N+1$ different possible events). What is the unconditional probability of $x$ seeds being taken? If $x>0$, then there is only one possible type of event --- the predator visited and took $x$ seeds --- with overall probability $v/(N+1)$ (Figure~\ref{fig:zi}, left). If $x=0$, then there are two mutually exclusive possibilities. Either the predator didn't visit (probability $1-v$), or it visited (probability $v$) and took zero seeds (probability $1/(N+1)$), so the overall probability is \begin{equation} \underbrace{(1-v)\vphantom{\frac{1}{N}}}_{\text{didn't visit}} + \left(\underbrace{v\vphantom{\frac{1}{N}}}_{\text{visited}} \times \underbrace{\frac{1}{N+1}}_{\text{took zero seeds}}\right) = 1-v+\frac{v}{N+1}. \end{equation} Now make things a little more complicated and suppose that when a predator visits, it decides independently whether or not to take each seed. If the seeds of a given species are all identical, so that each seed is taken with the same probability $p$, then this process results in a binomial distribution. Using the rules above, the probability of $x$ seeds being taken when each has probability $p$ is $p^x$. It's also true that $N-x$ seeds are \emph{not} taken, with probability $(1-p)^{N-x}$. Thus the probability is proportional to $p^x \cdot (1-p)^{N-x}$. To get the probabilities of all possible outcomes to add to 1, though, we have to multiply by a normalization constant $N!/(x!(N-x)!)$\footnote{$N!$ means $N \cdot (N-1) \cdot \ldots \cdot 2 \cdot 1$, and is referred to as ``$N$ factorial''.}, or $\binom{N}{x}$. (It's too bad we can't just ignore these ugly normalization factors, which are always the least intuitive parts of probability formulas, but we really need them in order to get the right answers. Unless you are doing advanced calculations, however, you can usually just take the formulas for the normalization constants for granted, without trying to puzzle out their meaning.) Now adding the ``predator may or may not visit'' layer to this formula, we have a probability \begin{equation} \underbrace{(1-v)\vphantom{B()}}_{\text{didn't visit}}+ \left(\underbrace{v\vphantom{B()}}_{\text{visited}}\cdot \underbrace{\mbox{Binom}(0,p,N)}_{\text{took zero seeds}}\right) = (1-v)+v (1-p)^N \end{equation} if $x=0$ ($\binom{N}{0}=1$, so the normalization constant disappears from the second term), or \begin{equation} \underbrace{v\vphantom{B()}}_{\text{visited}} \cdot \underbrace{\mbox{Binom}(x,p,N)}_{\text{took $>0$ seeds}} = v \binom{N}{x} p^x (1-p)^{N-x} \end{equation} if $x>0$ (Figure~\ref{fig:zi}, right). \begin{figure} <>= op <- par(cex=1.5,las=1,bty="l",lwd=2,mfrow=c(1,2)) ziunif = function(x,p,v,N) { ifelse(x==0,(1-v)+v/(N+1),v/(N+1)) } zibinom = function(x,p,v,N) { ifelse(x==0,(1-v)+v*dbinom(0,prob=p,size=N), v*dbinom(x,prob=p,size=N)) } N=5 p=0.4 v=0.7 barplot(ziunif(0:N,p,v,N),main="Zero-inflated uniform", xlab="Taken",ylab="Probability",names=0:5) barplot(zibinom(0:N,p,v,N),main="Zero-inflated binomial", xlab="Taken",ylab="Probability",names=0:5) par(op) @ \caption{Zero-inflated distributions. Left, zero-inflated uniform: right, zero-inflated binomial. Number of seeds $N=\Sexpr{N}$, probability of predator visit $v=\Sexpr{v}$, binomial probability of individual seed predation $p=\Sexpr{p}$.} \label{fig:zi} \end{figure} This distribution is called the \emph{zero-inflated binomial} \citep{Inouye1999,Tyre+2003}. With only a few simple probability rules, we have derived a potentially useful distribution that might describe the pattern of seed predation better than any of the standard distributions we'll see later in this chapter. \section{Bayes' Rule} %% fix transition??? With the simple probability rules defined above we can also derive, and understand, \emph{Bayes' Rule}. Most of the time we will use Bayes' Rule to go from the likelihood $\prob(D|H)$, the probability of observing a particular set of data $D$ given that a hypothesis $H$ is true (p.~\pageref{likedef}), to the information we really want, $\prob(H|D)$ --- % the probability of our hypothesis $H$ in light of our data $D$. %(For example, in the Bayes' Rule is just a recipe for turning around a conditional probability: \begin{equation} P(H | D) = \frac{P(D|H) P(H)}{P(D)}. \end{equation} Bayes' Rule is general --- $H$ and $D$ can be any events, not just hypothesis and data --- but it's easier to understand Bayes' Rule when we have something concrete to tie it to. \label{bayesrule} Deriving Bayes' Rule is almost as easy as remembering it. Rule \#4 on p.~\pageref{eq:cprob} applied to $P(H|D)$ implies \begin{equation} P(D \cap H) = P(H|D) P(D), \end{equation} while applying it to $P(D|H)$ tells us \begin{equation} P(H \cap D) = P(D|H) P(H). \end{equation} But $P(H \cap D) = P(D \cap H)$ so \begin{equation} P(H | D) P(D) = P(D | H) P(H) \end{equation} and therefore \begin{equation} P(H | D) = \frac{P(D | H) P(H)}{P(D)}. \label{eq:bayes1} \end{equation} Equation (\ref{eq:bayes1}) says that the probability of the hypothesis given (in light of) the data is equal to the probability of the data given the hypothesis (the \emph{likelihood} associated with $H$), times the probability of the hypothesis, divided by the probability of the data. There are two problems here: we don't know the probability of the hypothesis, $P(H)$ (isn't that what we're trying to figure out in the first place?), and we don't know the unconditional probability of the data, $P(D)$. \begin{figure} \begin{center} <>= r <- 0 d <- acos(r) scale <- c(0.5,0.3) npoints <- 100 centre <- c(0.5,0.5) a <- seq(0, 2 * pi, len = npoints) m <- matrix(c(scale[1] * cos(a + d/2) + centre[1], scale[2] * cos(a - d/2) + centre[2]), npoints, 2) e <- 0.05 hyp_pts = matrix(c(0.37,1.04, 1+e,0.8+e, 1,-e, 0.4,-e, -e,0.25), byrow=TRUE,ncol=2) lab.pts = matrix(c(0.091,0.255,0.597,0.557, 0.869,0.709,0.549,0.511, 0.170,0.22, ##y 0.865,0.613, 0.932,0.698,0.191,0.477, 0.087,0.277,0.077,0.31), ncol=2) ##hyp_pts <- hyp_pts[c(5,1:4),] ## lab.pts <- lab.pts[c(5,1:4),] par(mar=c(0.2,0.2,0.2,0.2)) plot(1,1,type="n",xlim=c((-e),1+e),ylim=c(-e,1+e),ann=FALSE, xlab="",ylab="",axes=FALSE,xaxs="i",yaxs="i") box() polygon(m,col="lightgray",lwd=2) polygon(c(-e,0.5,0.4,-e),c(0.25,0.5,-e,-e),density=8,angle=0, col="darkgray") lines(m,lwd=2) segments(rep(0.5,nrow(hyp_pts)),rep(0.5,nrow(hyp_pts)), hyp_pts[,1],hyp_pts[,2]) ##text(lab.pts[,1],lab.pts[,2],1:10) for(i in 1:5) { r = 2*i-1 r2 = 2*i text(lab.pts[r,1],lab.pts[r,2], substitute(H[x], list(x=i)),adj=0,cex=2) text(lab.pts[r2,1],lab.pts[r2,2], substitute(D*intersect("","","")*H[x], list(x=i)),adj=0,cex=2) } @ \end{center} \caption{Decomposition of the unconditional probability of the observed data ($D$) into the sum of the probabilities of the intersection of the data with each possible hypothesis ($\sum_{j=1}^N D \cap H_j$). The entire gray ellipse in the middle represents $D$. Each wedge (e.g. the hashed area $H_5$) represents an alternative hypothesis. The ellipse is divided into ``pizza slices'' (e.g. $D \cap H_5$, hashed and colored area). The area of each slice corresponds to $D \cap H_j$, the joint probability of the data $D$ (ellipse) \emph{and} the particular hypothesis $H_j$ (wedge). } \label{fig:bayes1} \end{figure} Let's think about the second problem first---our ignorance of $P(D)$. We can calculate an unconditional probability for the data if we have a set of \emph{exhaustive}, \emph{mutually exclusive} hypotheses: in other words, we assume that one, and only one, of our hypotheses is true. Figure~\ref{fig:bayes1} shows a geometric interpretation of Bayes' Rule. The gray ellipse represents $D$, the set of all possibilities that could lead to the observed data. If one of the hypotheses must be true, then the unconditional probability of observing the data is the sum of the probabilities of observing the data under any of the possible hypotheses, For $N$ different hypotheses $H_1$ to $H_N$, \begin{eqnarray} P(D) & = & \sum_{j=1}^N P(D \cap H_j) \nonumber \\ & = & \sum_{j=1}^N P(H_j) P(D|H_j). \label{eq:bayes2} \end{eqnarray} In words, the unconditional probability of the data is the sum of the likelihood of each hypothesis ($P(D|H_j)$) times its unconditional probability ($P(H_j)$). In Figure~\ref{fig:bayes1}, summing the area of overlap of each of the large wedges (the hypotheses $H_j$) with the gray ellipse ($H_j \cap D$) provides the area of the ellipse ($D$). Substituting (\ref{eq:bayes2}) into (\ref{eq:bayes1}) gives the full form of Bayes' Rule for a particular hypothesis $H_i$ when it is one of a mutually exclusive set of hypotheses $\{ H_j \}$. The probability of the truth of $H_i$ in light of the data is \begin{equation} P(H_i | D) = \frac{P(D | H_i) P(H_i)}{\sum_j P(H_j) P(D|H_j)} \end{equation} In Figure~\ref{fig:bayes1}, having observed the data $D$ means we know that reality lies somewhere in the gray ellipse. The probability that hypothesis~5 is true (i.e., that we are somewhere in the hashed area) is equal to the area of the hashed/colored ``pizza slice'' divided by the area of the ellipse. Bayes' Rule breaks this down further by supposing that we know how to calculate the likelihood of the data for each hypothesis --- the ratio of the pizza slice divided by the area of the entire wedge (the area of the pizza slice [$D \cap H_5$] divided by the hashed wedge [$H_5$]). Then we can recover the area of each slice by multiplying the likelihood by the prior (the area of the wedge) and calculate both $P(D)$ and $P(H_5|D)$. Dealing with the second problem, our ignorance of the unconditional or \emph{prior} probability of the hypothesis $P(H_i)$, is more difficult. In the next section we will simply assume that we have other information about this probability, and we'll revisit the problem shortly in the context of Bayesian statistics. But first, just to practice with Bayes' Rule, we'll explore two simpler examples that use Bayes' Rule to manipulate conditional probabilities. \subsection{False positives in medical testing} <>= probI=1e-6 probfp=1e-3 @ Suppose the unconditional probability of a random person sampled from the population being infected ($I$) with some deadly but rare disease is one in a million: $P(I)=10^{-6}$. There is a test for this disease that never gives a false negative result: if you have the disease, you will definitely test positive ($P(+|I) = 1$). However, the test does occasionally give a false positive result. One person in 100 who doesn't have the disease (is uninfected, $U$) will test positive anyway ($P(+|U) = 10^{-2}$). This sounds like a pretty good test. Let's compute the probability that someone who tests positive is actually infected. Replace $H$ in Bayes' rule with ``is infected'' ($I$) and $D$ with ``tests positive'' ($+$). Then \begin{equation} P(I|+) = \frac{P(+|I) P(I)}{P(+)}. \end{equation} We know $P(+|I)=1$ and $P(I)= 10^{-6}$, but we don't know $P(+)$, the unconditional probability of testing positive. Since you are either infected ($I$) or uninfected ($U$), so these events are mutually exclusive, \begin{equation} P(+) = P(+ \cap I) + P(+ \cap U). \end{equation} Then \begin{equation} P(+) = P(+|I) P(I) + P(+|U) P(U) \end{equation} because $P(I \cap +) = P(+|I) P(I)$ (eq. \ref{eq:cprob}). We also know that $P(U)=1-P(I)$, so \begin{equation} \begin{split} P(+) & = P(+|I) P(I) + P(+|U) (1-P(I)) \\ & = 1 \times 10^{-6} + 10^{-2} \times (1-10^{-6}) \\ & = 10^{-6} + 10^{-2} + 10^{-8} \\ & \approx 10^{-2}. \end{split} \end{equation} Since $10^{-6}$ is ten thousand times smaller than $10^{-2}$, and $10^{-8}$ is even tinier, we can neglect them for now. Now that we've done the hard work of computing the denominator, we can put it together with the numerator: \begin{equation} \begin{split} P(I|+) & = \frac{P(+|I) P(I)}{P(+)} \\ & \approx \frac{1 \times 10^{-6}}{10^{-2}} \\ & = 10^{-4} \end{split} \end{equation} Even though false positives are unlikely, the chance that you are infected if you test positive is still only 1 in 10,000! For a sensitive test (one that produces few false negatives) for a rare disease, the probability that a positive test is detecting a true infection is approximately $P(I)/P(\mbox{false positive})$, which can be surprisingly small. This false-positive issue also comes up in forensics cases (DNA testing, etc.). Assuming that a positive test is significant is called the \emph{base rate fallacy}. It's important to think carefully about the sample population and the true probability of being guilty (or at least having been present at the crime scene) conditional on having your DNA match DNA found at the crime scene. \subsection{Bayes' Rule and liana infestation} A student of mine used Bayes' Rule as part of a simulation model of liana (vine) dynamics in a tropical forest. He wanted to know the probability that a newly emerging sapling would be in a given ``liana class'' ($L_1$=liana-free, $L_2$--$L_3$=light to moderate infestation, $L_4$=heavily infested with lianas). This probability depends on the number of trees nearby that are already infested ($N$). We have measurements of infestation of saplings from the field, and for each one we know the number of nearby infestations. Thus if we calculate the fraction of individuals in liana class $L_i$ with $N$ nearby infested trees, we get an estimate of $\prob(N|L_i)$. We also know the overall fractions in each liana class, $\prob(L_i)$. When we add a new tree to the model, we know the neighborhood infestation $N$ from the model. Thus we can figure out what we want to know, $\prob(L_i|N)$, by using Bayes' Rule to calculate \begin{equation} \prob(L_i|N) = \frac{\prob(N|L_i) \prob(L_i)}% {\sum_{j=1}^4 \prob(N|L_j) \prob(L_j)}. \end{equation} For example, suppose we find that a new tree in the model has 3~infested neighbors. Let's say that the probabilities of each liana class (1 to 4) having 3 infested neighbors are $\prob(N|L_i) = \{ 0.05, 0.1, 0.3, 0.6 \}$ and that the unconditional probabilities of being in each liana class are $L_i = \{0.5, 0.25, 0.2, 0.05 \}$. Then the probability that the new tree is heavily infested (i.e. is in class $L_4$) is \begin{equation} \frac{0.6 \times 0.05}% {(0.05 \times 0.5) + (0.1 \times 0.25) + (0.3 \times 0.2) + (0.6 \times 0.05)} = \Sexpr{round((0.6*0.05)/(0.05*0.5+0.1*0.25+0.3*0.2+0.6*0.05),2)}. \end{equation} We would expect that a new tree with several infested neighbors has a much higher probability of heavy infestation than the overall (unconditional) probability of 0.05. Bayes' Rule allows us to quantify this guess. \subsection{Bayes' Rule in Bayesian statistics} So what does Bayes' Rule have to do with Bayesian statistics? Bayesians translate likelihood into information about parameter values using Bayes' Rule as given above. The problem is that we have the likelihood $\lik(\mbox{data}|\mbox{hypothesis})$, the probability of observing the data given the model (parameters): what we want is $\prob(\mbox{hypothesis}|\mbox{data})$. After all, we already know what the data are! \subsubsection{Priors} In the disease testing and the liana examples, we knew the overall, unconditional probability of disease or liana class in the population. When we're doing Bayesian statistics, however, we interpret $P(H_i)$ instead as the \emph{prior probability} of a hypothesis, our belief about the probability of a particular hypothesis \emph{before} we see the data. Bayes' Rule is the formula for updating the prior in order to compute the \emph{posterior probability} of each hypothesis, our belief about the probability of the hypothesis \emph{after} we see the data. Suppose I have two hypotheses $A$ and $B$ and have observed some data $D$ with likelihoods $\lik_A=0.1$ and $\lik_B=0.2$. In other words, the probability of $D$ occurring if hypothesis $A$ is true ($P(D|A)$) is 10\%, while the probability of $D$ occurring if hypothesis $B$ is true ($P(D|B)$) is 20\%. If I assign the two hypotheses equal prior probabilities ($0.5$ each), then Bayes' Rule says the posterior probability of $A$ is \begin{equation} P(A|D) = \frac{0.1 \times 0.5}{0.1 \times 0.5 + 0.2 \times 0.5}= \frac{0.1}{0.3} = \frac{1}{3} \end{equation} and the posterior probability of $B$ is 2/3. However, if I had prior information that said $A$ was twice as probable ($\prob(A)=2/3$, $\prob(B)=1/3$) then the probability of $A$ given the data would be 0.5 (do the calculation). It is in principle possible to get whatever answer you want, by rigging the prior: if you assign $B$ a prior probability of 0, then no data will \emph{ever} convince you that $B$ is true (in which case you probably shouldn't have done the experiment in the first place). Frequentists claim that this possibility makes Bayesian statistics open to cheating \citep{Dennis1996}: however, every Bayesian analysis must clearly state the prior probabilities it uses. If you have good reason to believe that the prior probabilities are not equal, from previous studies of the same or similar systems, then arguably you should \emph{use} that information rather than starting as frequentists do from the ground up every time. (The frequentist-Bayesian debate is one of the oldest and most virulent controversies in statistics \citep{Ellison1996,Dennis1996}: I can't possibly do it justice here.) However, it is a good idea to try so-called \emph{flat} or \emph{weak} or \emph{uninformative} priors --- priors that assume you have little information about which hypothesis is true --- as a part of your analysis, even if you do have prior information \citep{Edwards1996}. You may have noticed in the first example above that when we set the prior probabilities equal, the posterior probabilities were just equal to the likelihoods divided by the sum of the likelihoods. Algebraically if all the $P(H_i)$ are equal to the same constant $C$, \begin{equation} P(H_i | D) = \frac{P(D | H_i) C}{\sum_j P(D|H_j) C} = \frac{\lik_i}{\sum_j \lik_j} \end{equation} where $\lik_i$ is the likelihood of hypothesis $i$. You may think that setting all the priors equal would be an easy way to eliminate the subjective nature of Bayesian statistics and make everybody happy. Two examples, however, will demonstrate that it's not that easy to say what it means to be completely ``objective'' or ignorant of the right hypothesis. \begin{itemize} \item{\emph{partitioning hypotheses}: suppose we find a nest missing eggs that might have been taken by a raccoon, a squirrel, or a snake (only). The three hypotheses ``raccoon'' ($R$), ``squirrel'' ($Q$), and ``snake'' ($S$) are our mutually exclusive and exhaustive set of hypotheses for the identity of the predator. If we have no other information (for example about the local densities or activity levels of different predators), we might choose equal prior probabilities for all three hypotheses. Since there are three mutually exclusive predators, $\prob(R)=\prob(Q)=\prob(S)=1/3$. Now a friend comes and asks us whether we really believe that mammalian predators are twice as likely to eat the eggs as reptiles ($\prob(R)+\prob(Q)=2\prob(S)$) (Figure~\ref{fig:discprior}). What do we do? We might solve this particular problem by setting the probability for snakes (the only reptiles) to 0.5, the probability for mammals ($\prob(R \cup Q)$) to 0.5, and the probability for raccoons and squirrels equal ($\prob(R)=\prob(Q)=0.25$), but this simple example suggests that such pitfalls are ubiquitous. \begin{figure} \begin{center} <>= op <- par(cex=1.5,las=1,bty="l",lwd=2,yaxs="i") gcols <- gray(c(0.2,0.8)) b1 <- barplot(t(matrix(c(1/3,1/3,1/3,1/4,1/4,1/2),ncol=2)),beside=TRUE, xlab="Predator",ylab="Probability",space=c(0.2,2), col=gcols,yaxs="i") axis(side=1,at=colMeans(b1),c("raccoon","squirrel","snake")) segments(b1[1,1],0.4,b1[2,2],0.4) text((b1[1,1]+b1[2,2])/2,0.45,"mammalian") par(xpd=NA) legend(2,0.57,c("by species","by group"), ncol=2,fill=gcols,bty="n") par(op) @ \end{center} \caption{The difficulty of defining an uninformative prior for discrete hypotheses. Dark gray bars are priors that assume predation by each species is equally likely; light gray bars divide predation by group first, then by species within group.} \label{fig:discprior} \end{figure} } \item{\emph{changing scales}: \begin{figure} \begin{center} <>= minx <- 10 maxx <- 100 dx <- maxx-minx dlx <- log(maxx/minx) dlx10 <- log10(maxx/minx) xlim <- c(0,110) Lxlim <- c(9,110) op <- par(cex=2,las=1,bty="l",lwd=2,mfrow=c(1,2), mar=c(5,4,3,0.5)+0.1,yaxs="i") curve(ifelse(x>minx & xminx & xlog(minx) & xlog(minx) & x>= integrate(function(p)dbinom(2,prob=p,size=3),lower=0,upper=1) @ For the binomial case and other simple probability distributions, it's easy to sum or integrate the denominator either analytically or numerically. If we only care about the \emph{relative} probability of different hypotheses, we don't need to integrate the denominator because it has the same constant value for every hypothesis. Often, however, we do want to know the absolute probability. Calculating the unconditional probability of the data (the denominator for Bayes' Rule) can be extremely difficult for more complicated problems. Much of current research in Bayesian statistics focuses on ways to calculate the denominator. We will revisit this problem in Chapters~\ref{chap:lik} and~\ref{chap:opt}, first integrating the denominator by brute-force numerical integration, then looking briefly at a sophisticated technique for Bayesian analysis called Markov chain Monte Carlo. \subsection{Conjugate priors} Using so-called \emph{conjugate priors} makes it easy to do the math for Bayesian analysis. Imagine that we're flipping coins (or measuring tadpole survival or counting numbers of different morphs in a fixed sample) and that we use the binomial distribution to model the data. For a binomial with a per-trial probability of $p$ and $N$ trials, the probability of $x$ successes is proportional (leaving out the normalization constant) to $p^x (1-p)^{N-x}$. Suppose that instead of describing the probability of $x$ successes with a fixed per-trial probability $p$ and number of trials $N$ we wanted to describe the probability of a given per-trial probability $p$ with fixed $x$ and $N$. We would get $\prob(p)$ proportional to $p^x (1-p)^{N-x}$ --- \emph{exactly the same formula}, but with a different proportionality constant and a different interpretation. Instead of a discrete probability distribution over a sample space of all possible numbers of successes ($0$ to $N$), now we have a continuous probability distribution over all possible probabilities (all values between 0 and 1). The second distribution, for $\prob(p)$, is called the Beta distribution (p.~\pageref{sec:beta}) and it is the conjugate prior for the binomial distribution. Mathematically, conjugate priors have the same structure as the probability distribution of the data. They lead to a posterior distribution with the same mathematical form as the prior, although with different parameter values. Intuitively, you get a conjugate prior by turning the likelihood around to ask about the probability of a parameter instead of the probability of the data. We'll come back to conjugate priors and how to use them in Chapters~\ref{chap:lik} and \ref{chap:opt}. \section{Analyzing probability distributions} You need the same kinds of skills and intuitions about the characteristics of probability distributions that we developed in Chapter~\ref{chap:determ} for mathematical functions. \subsection{Definitions} \paragraph{Discrete} A probability distribution is the set of probabilities on a sample space or set of outcomes. Since this book is about modeling quantitative data, we will always be dealing with sample spaces that are numbers --- the number or amount observed in some measurement of an ecological system. The simplest distributions to understand are \emph{discrete} distributions whose outcomes are a set of integers: most of the discrete distributions we'll deal with describe counting or sampling processes and have ranges that include some or all of the non-negative integers. A discrete distribution is most easily described by its distribution function, which is just a formula for the probability that the outcome of an experiment or observation (called a \emph{random variable}) $X$ is equal to a particular value $x$ ($f(x)=\prob(X=x)$). A distribution can also be described by its cumulative distribution function $F(x)$ (note the uppercase $F$), which is the probability that the random variable $X$ is less than or equal to a particular value $x$ ($F(x)=\prob(X\le x$). Cumulative distribution functions are most useful for frequentist calculations of tail probabilities, e.g. the probability of getting $n$ or more heads in a series of coin-tossing experiments with a given trial probability. % I'll use the binomial distribution, % the probability of a given number of ``hits'' % or ``successes'' in a fixed number of % independent, equiprobable trials, as % an example in the . With $N$ trials with % a probability of $p$ of success in each one, % the binomial probability distribution % for the number of successes $k$ is % is $\binom{N}{k} p^k (1-p)^{N-k}$ (for $x$ between % 0 and $N$). % I'll also refer to the Poisson distribution, % which describes the distribution % of the number of ``events'' (animals counted, % hurricanes, matings) when each event is independent % and there is no upper limit on the number % of events (p.~\pageref{sec:poisson}). \paragraph{Continuous} \label{sec:contdens} A probability distribution over a continuous range (such as all real numbers, or the non-negative real numbers) is called a \emph{continuous} distribution. %Continuous distributions have a similar definition %to discrete distributions. The cumulative distribution function of a continuous distribution ($F(x)=\prob(X \le x)$ is easy to define and understand % --- it's just the probability that the continuous random variable $X$ is smaller than a particular value $x$ in any given observation or experiment --- but the probability \emph{density function} (the analogue of the distribution function for a discrete distribution) is more confusing, since the probability of any precise value is zero. You may imagine that a measurement of (say) pH is \emph{exactly} 7.9, but in fact what you have observed is that the pH is between 7.82 and 7.98 --- if your meter has a precision of $\pm$ 1\%. Thus continuous probability distributions are expressed as probability \emph{densities} rather than probabilities --- the probability that random variable $X$ is between $x$ and $x+\Delta x$, divided by $\Delta x$ ($\prob(7.82 < X < 7.98)/0.16$, in this case). Dividing by $\Delta x$ allows the observed probability density to have a well-defined limit as precision increases and $\Delta x$ shrinks to zero. Unlike probabilities, Probability densities can be larger than 1 (Figure~\ref{fig:probdists}). For example, if the pH probability distribution is uniform on the interval [7,7.1] but zero everywhere else, its probability density is 10. In practice, we will mostly be concerned with \emph{relative} probabilities or likelihoods, and so the maximum density values and whether they are greater than or less than 1 won't matter much. \begin{figure} <>= op=par(mfrow=c(2,2),mgp=c(2.5,1,0),mar=c(3.2,4.3,1,1),bty="l",las=1, cex=1.5,yaxs="i") plot(0:5,dbinom(0:5,size=5,prob=0.3), main="", xlab="", ylab="Probability",type="h",lwd=4, axes=FALSE,ylim=c(0,0.4)) axis(side=1) axis(side=2,at=c(0,0.4)) box() plot(0:5,pbinom(0:5,size=5,prob=0.3), main="",type="h",lwd=4, xlab="", ylab="Cumulative\nprobability",ylim=c(0,1),axes=FALSE) box() axis(side=1) axis(side=2,at=c(0,1)) curve(dexp(x,1.5),from=0,to=6,ylab="Probability density", main="",xaxs="i",yaxs="i",axes=FALSE,ylim=c(0,1.5)) axis(side=1,at=c(0,6)) axis(side=2,at=c(0,1.5)) box() curve(pexp(x,1.5),from=0,to=6,ylab="Cumulative\nprobability", main="",xaxs="i",yaxs="i",axes=FALSE,ylim=c(0,1)) axis(side=1,at=c(0,6)) axis(side=2,at=c(0,1)) box() par(op) @ \caption{Probability, probability density, and cumulative distributions. Top: discrete (binomial: $N=5$, $p=0.3$) probability and cumulative probability distributions. Bottom: continuous (exponential: $\lambda=1.5$) probability density and cumulative probability distributions.} \label{fig:probdists} \end{figure} \subsection{Means (expectations)} The first thing you usually want to know about a distribution is its average value, also called its mean or expectation. In general the expectation operation, denoted by $E[\cdot]$ (or a bar over a variable, such as $\bar x$) gives the ``expected value'' of a set of data, or a probability distribution, which in the simplest case is the same as its (arithmetic) mean value. For a set of $N$ data values written down separately as $\{x_1,x_2,x_3, \ldots x_N\}$, the formula for the mean is familiar: \begin{equation} E[x] = \frac{\sum_{i=1}^N x_i}{N}. \end{equation} Suppose we have the data tabulated instead, so that for each possible value of $x$ (for a discrete distribution) we have a count of the number of observations (possibly zero, possibly more than 1), which we call $c(x)$. Summing over all of the possible values of $x$, we have \begin{equation} E[x] = \frac{\sum_{i=1}^N x_i}{N} = \frac{\sum c(x) x}{N} = \sum \left(\frac{c(x)}{N}\right) x = \sum \prob(x) x \end{equation} where $\prob(x)$ is the discrete probability distribution representing this particular data set. More generally, you can think of $\prob(x)$ as representing some particular theoretical probability distribution which only approximately matches any actual data set. We can compute the mean of a continuous distribution as well. First, let's think about grouping (or ``binning'') the values in a discrete distribution into categories of size $\Delta x$. Then if $p(x)$, the density of counts in bin $x$, is $c(x)/\Delta x$, the formula for the mean becomes $\sum p(x) \cdot x \Delta x$. If we have a continuous distribution with $\Delta x$ very small, this becomes $\int p(x) x \, dx$. (This is in fact the definition of an integral.) For example, an exponential distribution $p(x) = \lambda \exp(-\lambda x)$ has an expectation or mean value of $\int \lambda \exp(-\lambda x) x \, dx=1/\lambda$. (You don't need to know how to do this integral analytically, although the \R\ supplement will show a little bit about numerical integration in \R.) \subsection{Variances (expectation of $X^2$)} The mean is the expectation of the random variable $X$ itself, but we can also ask about the expectation of functions of $X$. The first example is the expectation of $X^2$. We just fill in the value $x^2$ for $x$ in all of the formulas above: $E[x^2]=\sum \prob(x) x^2$ for a discrete distribution, or $\int p(x) x^2 \, dx$ for a continuous distribution. (We are \emph{not} asking for $\sum \prob(x^2) x^2$.) The expectation of $x^2$ is a component of the variance, which is the expected value of $(x-E[x])^2$ or $(x-\bar x)^2$, or the expected squared deviation around the mean. (We can also show that % \begin{eqnarray*} % E[(x-E[x])^2] & = & E[x^2-2xE[x]+(E[x])^2] \\ % & = & E[x^2]-E[2xE[x]]+E[E[x]^2] \\ % & = & E[x^2]-2 E[x] \cdot E[x] +E[x]^2 \\ % & = & E[x^2]-2 (E[x])^2 +(E[x])^2 \\ % & = & E[x^2]-(E[x])^2 % \end{eqnarray*} \begin{equation} E[(x-\bar x)^2] = E[x^2] - (\bar x)^2 \end{equation} by using the rules for expectations that (1) $E[x+y]=E[x]+E[y]$ and (2) if $c$ is a constant, $E[cx]=cE[x]$. The right-hand formula formula is simpler to compute than $E[(x-\bar x)^2]$, but more subject to roundoff error.) Variances are easy to work with because they are additive (we will show later that $\mbox{Var}(a+b)=\mbox{Var}(a)+\mbox{Var}(b)$ if $a$ and $b$ are uncorrelated), but harder to compare with means since their units are the units of the mean squared. Thus we often use instead the \emph{standard deviation} of a distribution, ($\sqrt{\mbox{Var}}$), which has the same units as $X$. Two other summaries related to the variance are the \emph{variance-to-mean} ratio and the \emph{coefficient of variation} (CV), which is the ratio of the standard deviation to the mean. The variance-to-mean ratio has units equal to the mean; it is primarily used to characterize discrete sampling distributions and compare them to the Poisson distribution, which has a variance-to-mean ratio of 1. The CV is more common, and is useful when you want to describe variation that is proportional to the mean. For example, if you have a pH meter that is accurate to $\pm 10\%$, so that a true pH value of $x$ will give measured values that are normally distributed with $2 \sigma = 0.1 x$\footnote{Remember that the 95\% confidence limits of the normal distribution are approximately $\mu \pm 2 \sigma$.}, then $\sigma = 0.05 x$ and the CV is 0.05. \subsection{Higher moments} The expectation of $(x-E[x])^3$ tells you the \emph{skewness} of a distribution or a data set, which indicates whether it is asymmetric around its mean. The expectation $E[(x-E[x])^4]$ measures the \emph{kurtosis}, the ``pointiness'' or ``flatness'', of a distribution. These are called the third and fourth \emph{central moments} of the distribution. In general, the $n^{\mbox{th}}$ moment is $E[x^n]$, and the $n^{\mbox{th}}$ central moment is $E[(x-\bar x)^n]$; the mean is the first moment, and the variance is the second central moment. We won't be too concerned with these summaries (of data or distributions), but they do come up sometimes. % ignore expectations over arbitrary distributions % since they're only needed here for JE %serve to introduce the more general idea of %taking the expectation of a function over a distribution. % Instead of taking the expectation of $x$ (the mean) % or $(x-E[x])^2$ (the variance), suppose we take % the expectation of some arbitrary function $g(x)$ % (for example the Michaelis-Menten function) over % a probability distribution $P(x)$ or a density function $p(x)$: % \begin{equation} % E[g(x),P(x)] = \sum_{i=1}^N P(x) g(x) % \end{equation} % or % \begin{equation} % E[g(x),p(x)] = \int p(x) g(x) \, dx. % \end{equation} \subsection{Median and mode} The median and mode are two final properties of probability distributions that are not related to moments. The \emph{median} of a distribution is the point which divides the area of the probability density in half, or the point at which the cumulative distribution function is equal to 0.5. It is often useful for describing data, since it is \emph{robust} --- outliers change its value less than they change the mean --- but for many distributions it's more complicated to compute than the mean. The \emph{mode} is the ``most likely value'', the maximum of the probability distribution or density function. For symmetric distributions the mean, mode, and median are all equal; for right-skewed distributions, in general mode $<$ median $<$ mean. \subsection{The method of moments} Suppose you know the theoretical values of the moments (e.g. mean and variance) of a distribution and have calculated the sample values of the moments (by calculating $\bar x = \sum x/N$ and $s^2= \sum (x-\bar x)^2/N$: don't worry for the moment about whether the denominator in the sample variance should be $N$ or $N-1$). Then there is a simple way to estimate the parameters of a distribution, called the \emph{method of moments}: just match the sample values up with the theoretical values. For the normal distribution, where the parameters of the distribution are just the mean and the variance, this is trivially simple: $\mu=\bar x$, $\sigma^2=s^2$. For a distribution like the negative binomial, however (p.~\pageref{sec:negbinom}), it involves a little bit of algebra. The negative binomial has parameters $\mu$ (equal to the mean, so that's easy) and $k$; the theoretical variance is $\sigma^2=\mu(1+\mu/k)$. Therefore, setting $\mu=\bar x$, $s^2 \approx \mu(1+\mu/k)$, and solving for $k$, we calculate the method-of-moments estimate of $k$: \begin{equation} \begin{split} \sigma^2 & = \mu(1+\mu/k) \\ s^2 & \approx \bar x (1+\bar x/k) \\ \frac{s^2}{\bar x} - 1 & \approx \frac{\bar x}{k} \\ k & \approx \frac{\bar x}{s^2/\bar x - 1} \end{split} \end{equation} \label{negbinom-moments} The method of moments is very simple but is biased in many cases; it's a good way to get a first estimate of the parameters of a distribution, but for serious work you should follow it up with a maximum likelihood estimator (Chapter~\ref{chap:lik}). \section{Bestiary of distributions} The rest of the chapter presents brief introductions to a variety of useful probability distributions, including the mechanisms behind them and some of their basic properties. Like the bestiary in Chapter~\ref{chap:determ}, you can skim this bestiary on the first reading. The appendix of \citet{Gelman+1996} contains a useful table, more abbreviated than these descriptions but covering a wider range of functions. The book by \citet{Evans+2000} is also useful. % NOT mentioning Johnson and Kotz, too much for this audience % The book goes over mathematical properties, equations, etc. in some % detail, but I want to hit a few highlights of the kinds of places % these models might be appropriate. The \emph{phenomenological} % (``gee, this fits the wiggles in my data'') vs. \emph{mechanistic} % (``this distribution is generated by the same sorts of processes as I % think might be happening in my system'') distinction holds, just as % for the ecological functions discussed in lecture 4. \subsection{Discrete models} % You can see Phil Pollett's Java applets that illustrate these distributions, % and dynamically show you the effects of changing the parameters, % at % \url{http://www.maths.uq.edu.au/~pkp/teaching/PlotProb/}. \subsubsection{Binomial} The binomial is probably the easiest distribution to understand. It applies when you have samples with a fixed number of subsamples or ``trials'' in each one, and each trial can have one of two values (black/white, heads/tails, alive/dead, species A/species B), and the probability of ``success'' (black, heads, alive, species A) is the same in every trial. If you flip a coin 10 times ($N=10$) and the probability of a head in each coin flip is $p=0.7$ then the probability of getting 7 heads ($k=7$) will will have a binomial distribution with parameters $N=10$ and $p=0.7$\footnote{\cite{GelmanNolan2002} point out that it is not physically possible to construct a coin that is biased when flipped --- although a spinning coin can be biased. \cite{Diaconis+2004} even tested a coin made of balsa wood on one side and lead on the other to establish that it was unbiased.} Don't confuse the trials (subsamples), and the probability of success in each trial, with the number of samples and the probabilities of the number of successful trials in each sample. In the seed predation example, a trial is an individual seed and the trial probability is the probability that an individual seed is taken, while a sample is the observation of a particular station at a particular time and the binomial probabilities are the probabilities that a certain total number of seeds disappears from the station. You can derive the part of the distribution that depends on $x$, $p^x (1-p)^{N-x}$, by multiplying the probabilities of $x$ independent successes with probability $p$ and $N-x$ independent failures with probability $1-p$. The rest of the distribution function, $\binom{N}{x}=N!/(x!(N-x)!)$, is a \emph{normalization constant} that we can justify either with a combinatorial argument about the number of different ways of sampling $x$ objects out of a set of $N$ (Appendix), or simply by saying that we need a factor in front of the formula to make sure the probabilities add up to 1. The variance of the binomial is $N p(1-p)$. Like most discrete sampling distributions (e.g. the binomial, Poisson, negative binomial), this variance depends on the number of samples per trial $N$. When the number of samples per trial increases the variance also increases, but the coefficient of variation ($\sqrt{Np(1-p)}/(Np)=\sqrt{(1-p)/(Np)}$) decreases. The dependence on $p(1-p)$ means the binomial variance is small when $p$ is close to 0 or 1 (and therefore the values are scrunched up near 0 or $N$), and largest when $p=0.5$. The coefficient of variation, on the other hand, is largest for small $p$. When $N$ is large and $p$ isn't too close to 0 or 1 (i.e. when $Np$ is large), then the binomial distribution is approximately normal (Figure~\ref{fig:probdisttab}). A binomial distribution with only one trial ($N=1$) is called a \emph{Bernoulli} trial. You should only use the binomial in fitting data when there is an upper limit to the number of possible successes. When $N$ is large and $p$ is small, so that the probability of getting $N$ successes is small, the binomial approaches the Poisson distribution, which is covered in the next section (Figure~\ref{fig:probdisttab}). <>= multdiscplot = function(x,parvec,parname="prob",dfun="dbinom",col, displ=0.2,xlab="x",ylab="Probability",lwd=7,axes=TRUE,...) { op = par(lend=2) tmpfun = function(p) { params = c(list(...),x=list(x),p=p) names(params)[length(params)]=parname do.call(dfun,params) } vals=sapply(parvec,tmpfun) n =length(parvec) xoff = seq(-displ*n/2,displ*n/2,length=n) plot(range(x)+range(xoff),range(vals),type="n",xlab=xlab,ylab=ylab, axes=axes) for (i in 1:length(parvec)) { points(x-xoff[i],vals[,i],col=col[i],type="h",lwd=lwd) } par(op) } multdiscplot2 = function(x,parmat,parnames=c("mu","size"), dfun="dnbinom",col,displ=0.2,xlab="x",ylab="Probability",lwd=7,...) { op = par(lend=2) n =nrow(parmat) vals = matrix(nrow=length(x),ncol=nrow(parmat)) for (i in 1:n) { params = c(list(...),x=list(x),p1=parmat[i,1],p2=parmat[i,2]) lp = length(params) names(params)[(lp-1):lp] = parnames vals[,i] = do.call(dfun,params) } xoff = seq(-displ*n/2,displ*n/2,length=n) plot(range(x)+range(xoff),range(vals),type="n",xlab=xlab,ylab=ylab) for (i in 1:n) { points(x-xoff[i],vals[,i],col=col[i],type="h",lwd=lwd) } par(op) } @ \begin{figure} <>= op <- par(lwd=2,cex=1.5,las=1,bty="l",mgp=c(3,1,0),yaxs="i") ## palette(gray((0:8)/8)) multdiscplot(0:10,parvec=c(0.1,0.5,0.9),size=10,col=gray((0:3)/3),lwd=5, xlab="# of successes",displ=0.1) par(xpd=NA) text(c(1,5,9), c(0.42,0.27,0.42), paste("p=",c(0.1,0.5,0.9),sep=""), col=gray((0:3)/3)) #legend(2,0.5,paste("N=10, p=",c(0.1,0.5,0.9),sep=""), # lty=1,lwd=5,col=gray((0:3)/3)) par(xpd=TRUE) par(op) @ \caption{Binomial distribution. Number of trials ($N$) equals 10 for all distributions.} \label{fig:binom} \end{figure} \emph{Examples:} number of surviving individuals/nests out of an initial sample; number of infested/infected animals, fruits, etc. in a sample; number of a particular class (haplotype, subspecies, etc.) in a larger population. \textbf{Summary:} \\ \begin{tabular}{ll} range & discrete, $0 \le x \le N$ \\ distribution & $\binom{N}{x} p^x (1-p)^{N-x}$ \\ \R & \code{dbinom}, \code{pbinom}, \code{qbinom}, \code{rbinom} \\ parameters & $p$ [real, 0--1], probability of success [\code{prob}] \\ & $N$ [positive integer], number of trials [\code{size}] \\ mean & $Np$ \\ variance & $Np(1-p)$ \\ CV & $\sqrt{(1-p)/(Np)}$ \\ Conjugate prior & Beta \end{tabular} \subsubsection{Poisson} \label{sec:poisson} The Poisson distribution gives the distribution of the number of individuals, arrivals, events, counts, etc., in a given time/space/unit of counting effort if each event is independent of all the others. The most common definition of the Poisson has only one parameter, the average density or arrival rate, $\lambda$, which equals the expected number of counts in a sampling unit. An alternative parameterization gives a density \emph{per unit sampling effort} and then specifies the mean as the product of the density per sampling effort $r$ times the sampling effort $t$, $\lambda=rt$. This parameterization emphasizes that even when the population density is constant, you can change the Poisson distribution of counts by sampling more extensively --- for longer times or over larger quadrats. The Poisson distribution has no upper limit, although values much larger than the mean value are highly improbable. This characteristic provides a rule for choosing between the binomial and Poisson. If you expect to observe a ``ceiling'' on the number of counts, you should use the binomial; if you expect the number of counts to be effectively unlimited, even if it is theoretically bounded (e.g. there can't really be an infinite number of plants in your sampling quadrat), use the Poisson. The variance of the Poisson is equal to its mean. However, the \emph{coefficient of variation} (CV=standard deviation/mean) decreases as the mean increases, so in that sense the Poisson distribution becomes more regular as the expected number of counts increases. The Poisson distribution \emph{only makes sense for count data}. Since the CV is unitless, it should not depend on the units we use to express the data; since the CV of the Poisson is 1/$\sqrt{\mbox{mean}}$, that means that if we used a Poisson distribution to describe data on measured lengths, we could reduce the CV by a factor of 10 by changing from meters to centimeters (which would be silly). For $\lambda<1$ the Poisson's mode is at zero. When the expected number of counts gets large (e.g. $\lambda>10$) the Poisson becomes approximately normal (Figure~\ref{fig:probdisttab}). \begin{figure} <>= par(lwd=2,cex=1.5,las=1,bty="l",mgp=c(2.5,1,0),yaxs="i") pvec = c(0.8,3,12) multdiscplot(0:20,parvec=pvec,parname="lambda", dfun="dpois",col=gray((0:3)/3),lwd=3, xlab="# of events",displ=0.1) ## construct legend by hand ## legend(10,0.4, ## c(expression(lambda==0.8), ## expression(lambda==3), ## expression(lambda==12)), ## lwd=3,col=1:3,lty=1) text(c(1.15,2.7,10.2), c(0.42,0.26,0.15), c(expression(lambda==0.8), expression(lambda==3), expression(lambda==12)), col=gray((0:3)/3),adj=0) ## leg.y = seq(0.38,length=3,by=-0.02) ## segments(rep(10,3),leg.y,rep(12,3),leg.y, ## col=gray((0:3)/3),lwd=3) ## for (i in 1:3) ## text(rep(12.4,3),leg.y[i], ## substitute(lambda==z,list(z=pvec[i])),adj=0) @ \caption{Poisson distribution.} \label{fig:poisson} \end{figure} \emph{Examples:} number of seeds/seedlings falling in a gap; number of offspring produced in a season (although this might be better fit by a binomial if the number of breeding attempts is fixed); number of prey caught per unit time. \textbf{Summary:} \\ \begin{tabular}{ll} range & discrete ($0 \le x$)\\ distribution & $\frac{e^{-\lambda} \lambda^n}{n!}$ \\ & \textbf{or} $\frac{e^{-r t} (rt)^n}{n!}$ \\ \R & \code{dpois}, \code{ppois}, \code{qpois}, \code{rpois} \\ parameters & $\lambda$ (real, positive), expected number per sample [\code{lambda}] \\ & \textbf{or} $r$ (real, positive), expected number per unit effort, area, time, etc. (\emph{arrival rate}) \\ mean & $\lambda$ (\textbf{or} $rt$) \\ variance & $\lambda$ (\textbf{or} $rt$) \\ CV & $1/\sqrt{\lambda}$ (\textbf{or} $1/\sqrt{rt}$) \\ Conjugate prior & Gamma \\ \end{tabular} \subsubsection{Negative binomial} \label{sec:negbinom} Most probability books derive the negative binomial distribution from a series of independent binary (heads/tails, black/white, male/female, yes/no) trials that all have the same probability of success, like the binomial distribution. Rather than count the number of successes obtained in a fixed number of trials, which would result in a binomial distribution, the negative binomial counts the number of \emph{failures} before a predetermined number of successes occurs. This failure-process parameterization is only occasionally useful in ecological modeling. % (although I have used a variant of the negative % binomial to describe the distribution of the infectious period of a % disease in days, assuming that individuals have to go through multiple % stages before they recover or die) Ecologists use the negative binomial because it is discrete, like the Poisson, but its variance can be larger than its mean (i.e. it can be \emph{overdispersed}). Thus, it's a good phenomenological description of a patchy or clustered distribution with no intrinsic upper limit that has more variance than the Poisson. The ``ecological'' parameterization of the negative binomial replaces the parameters $p$ (probability of success per trial: \code{prob} in \R) and $n$ (number of successes before you stop counting failures: \code{size} in \R) with $\mu=n(1-p)/p$, the mean number of failures expected (or of counts in a sample: \code{mu} in \R), and $k$, which is typically called an \emph{overdispersion parameter}. Confusingly, $k$ is also called \code{size} in \R, because it is mathematically equivalent to $n$ in the failure-process parameterization. The overdispersion parameter measures the amount of clustering, or aggregation, or heterogeneity, in the data: a smaller $k$ means more heterogeneity. The variance of the negative binomial distribution is $\mu+\mu^2/k$, and so as $k$ becomes large the variance approaches the mean and the distribution approaches the Poisson distribution. For $k>10$, the negative binomial is hard to tell from a Poisson distribution, but $k$ is often less than 1 in ecological applications\footnote{Beware of the word ``overdispersion'', which is sometimes used with an opposite meaning in spatial statistics, where it can mean ``more regular than expected from a random distribution of points''. If you took quadrat samples from such an ``overdispersed'' population, the distribution of counts would have variance less than the mean and be ``underdispersed'' in the probability distribution sense \citep{BrownBolker2004} (!)}. Specifically, you can get a negative binomial distribution as the result of a Poisson sampling process where the rate $\lambda$ itself varies. If the distribution of $\lambda$ is a gamma distribution (p.~\pageref{sec:gamma}) with shape parameter $k$ and mean $\mu$, and $x$ is Poisson-distributed with mean $\lambda$, then the distribution of $x$ be a negative binomial distribution with mean $\mu$ and overdispersion parameter $k$ \citep{May1978,HilbornMangel1997}. In this case, the negative binomial reflects unmeasured (``random'') variability in the population. Negative binomial distributions can also result from a homogeneous birth-death process, births and deaths (and immigrations) occurring at random in continuous time. Samples from a population that starts from 0 at time $t=0$, with immigration rate $i$, birth rate $b$, and death rate $d$ will be negative binomially distributed with parameters $\mu=i/(b-d) (e^{(b-d)t}-1)$ and $k=i/b$ \citep[p. 99]{Bailey1964}. Several different ecological processes can often generate the same probability distribution. We can usually reason forward from knowledge of probable mechanisms operating in the field to plausible distributions for modeling data, but this many-to-one relationship suggests that it is unsafe to reason backwards from probability distributions to particular mechanisms that generate them. \begin{figure} <>= par(lwd=2,cex=1.5,las=1,bty="l",mgp=c(2.5,1,0),yaxs="i") xvec = 0:10 pvec = c(10,1,0.1) multdiscplot(xvec,parvec=pvec, dfun="dnbinom", mu=2,parname=c("size"), col=gray((0:3)/3),lwd=4, displ=0.1) ##legend(4,0.6, text(c(3.53,0.37,0.16), c(0.18,0.32,0.59), c(expression(k==10), expression(k==1), expression(k==0.1)), adj=0, col=gray((0:3)/3),lty=1,lwd=3) ##leg.y = seq(0.6,length=3,by=-0.06) ##segments(rep(4,3),leg.y,rep(6,3),leg.y, ## col=gray((0:3)/3),lwd=3) ##for (i in 1:3) ## text(rep(6.4,3),leg.y[i], ## substitute(list(mu==2,k==z),list(z=pvec[i])),adj=0) @ \caption{Negative binomial distribution. Mean $\mu$ = 2 in all cases.} \label{fig:negbinom} \end{figure} \emph{Examples:} essentially the same as the Poisson distribution, but allowing for heterogeneity. Numbers of individuals per patch; distributions of numbers of parasites within individual hosts; number of seedlings in a gap, or per unit area, or per seed trap. % prob=size/(size+mu); size/prob = mu+size; mu = size/prob-size = size*(1/p-1) = size*((1-p)/p) \textbf{Summary:} \\ \begin{tabular}{ll} range & discrete, $x \ge 0$\\ distribution & $\frac{(n+x-1)!}{(n-1!) x!} p^n (1-p)^x$ \\ & \textbf{or} $\frac{\Gamma(k+x)}{\Gamma(k) x!} (k/(k+\mu))^k (\mu/(k+\mu))^x$ \\ \R & \code{dnbinom}, \code{pnbinom}, \code{qnbinom}, \code{rnbinom} \\ parameters & $p$ ($0>= par(lwd=2,cex=1.5,las=1,bty="l",yaxs="i") pvec=c(0.2,0.5) multdiscplot(0:20,parvec=pvec, dfun="dgeom", parname="prob", col=gray((0:2)/2),lwd=3, displ=0.1, xlab="Survival time") text(c(5.4,0.5), c(0.082,0.41), paste("p=",pvec,sep=""), adj=0, col=gray((0:2)/2)) @ \caption{Geometric distribution.} \label{fig:geometric} \end{figure} \emph{Examples:} number of successful/survived breeding seasons for a seasonally reproducing organism. Lifespans measured in discrete units. \textbf{Summary:} \\ \begin{tabular}{ll} range & discrete, $x \ge 0$ \\ distribution & $p(1-p)^x$ \\ \R & \code{dgeom}, \code{pgeom}, \code{qgeom}, \code{rgeom} \\ parameters & $p$ ($0>= op <- par(lwd=2,cex=1.5,las=1,bty="l",mgp=c(3,1,0),yaxs="i") multdiscplot(0:10,parvec=c(0.5,5),parname="theta", dfun="dbetabinom",prob=0.5,size=10, col=gray((0:3)/3),lwd=5, xlab="# of successes",displ=0.1) par(xpd=NA) text(0.5,0.25,expression(theta==0.5),col=gray(0),pos=4) text(4,0.15, expression(theta==5),col=gray(1/3),pos=4) #legend(2,0.5,paste("N=10, p=",c(0.1,0.5,0.9),sep=""), # lty=1,lwd=5,col=gray((0:3)/3)) par(xpd=TRUE) par(op) @ \caption{Beta-binomial distribution. Number of trials ($N$) equals 10, average per-trial probability ($p$) equals 0.5 for all distributions.} \label{fig:betabinom} \end{figure} \emph{Examples:} as for the binomial. \subsection{Continuous distributions} %% You can see Phil Pollett's Java applets that illustrate these distributions, %% and dynamically show you the effects of changing the parameters, %% at %% \url{http://www.maths.uq.edu.au/~pkp/teaching/PlotDist/}. \subsubsection{Uniform distribution} The uniform distribution with limits $a$ and $b$, denoted $U(a,b)$, has a constant probability density of $1/(b-a)$ for $a \le x \le b$ and zero probability elsewhere. The standard uniform, $U(0,1)$, is very commonly used as a building block for other distributions, but is surprisingly rarely used in ecology otherwise. \textbf{Summary:} \\ \begin{tabular}{ll} range & $a \le x \le b$ \\ distribution & $1/(b-a)$ \\ \R & \code{dunif}, \code{punif}, \code{qunif}, \code{runif} \\ parameters & minimum ($a$) and maximum ($b$) limits (real) [\code{min}, \code{max}]\\ mean & $(a+b)/2$ \\ %% ((b^3-a^3)/(3*(b-a)) -(a+b)^2/4) %% (b^3/3-a^3/3-a^2*b/4-2*a*b^2/4-b^3/4+a^3/4+2*a^2*b/4+a*b^2/4)/(b-a) %% (b^3/3-b^3/4-a^3/3+a^3/4-a^2*b/4+2*a^2*b/4-2*a*b^2/4+a*b^2/4)/(b-a) %% (b^3/12-a^3/12+a^2*b/4-a*b^2/4)/(b-a) %% (A+B)^3 = (A^3+3*A^2*B+3*A*B^2+B^3) %% (b-a)^2/12 variance & $(b-a)^2/12$ \\ %% (b-a)/(2*sqrt(3))/(a+b)/2 %% = (b-a)/((a+b)*sqrt(3)) CV & $(b-a)/((a+b) \sqrt{3})$ \end{tabular} \begin{figure} <>= op <- par(lwd=2,cex=1.5,las=1,bty="l",yaxs="i") par(mar=c(5,4,2,6)) curve(dunif(x,0,1),from=-0.3,to=2.3, xlab="Value",ylab="Probability density", axes=FALSE,type="S") axis(side=2) axis(side=1) curve(dunif(x,0.5,2),add=TRUE,lty=2,type="S") par(xpd=NA) text(c(0.5,1.6), c(1.05,1/1.5+0.05), c("U(0,1)","U(0.5,2.5)")) par(xpd=FALSE) box() par(op) @ \caption{Uniform distribution.} \label{fig:unif} \end{figure} \subsubsection{Normal distribution} Normally distributed variables are everywhere, and most classical statistical methods use this distribution. The explanation for the normal distribution's ubiquity is the \emph{Central Limit Theorem}, which says that if you add a large number of independent samples from the same distribution the distribution of the sum will be approximately normal. ``Large'', for practical purposes, can mean as few as 5. The central limit theorem does \emph{not} mean that ``all samples with large numbers are normal''. One obvious counterexample is two different populations with different means that are lumped together, leading to a distribution with two peaks (p.~\pageref{sec:discmix}). Also, adding isn't the only way to combine samples: if you multiply independent samples from the same distribution, you get a log-normal distribution instead of a normal distribution (p.~\pageref{sec:lnorm}). Many distributions (binomial, Poisson, negative binomial, gamma) become approximately normal in some limit (Figure~\ref{fig:probdisttab}). You can usually think about this as some form of ``adding lots of things together''. The normal distribution specifies the mean and variance separately, with two parameters, which means that one often assumes constant variance (as the mean changes), in contrast to the Poisson and binomial distribution where the variance is a fixed function of the mean. % Whenever you can, it's easiest to explain processes that match the % definitions of the Poisson or binomial with the correct distribution. % This takes care of the systematic changes of the variance in a natural % way. % However, you can also define models with % \emph{heteroscedasticity}---variance that is not a function of the % mean. % For example, $Y \sim a X + b + N(0, c X^d)$ specifies a linear % regression (the expected value of $Y$ is a linear function of $X$) % where the variance, instead of being constant, is itself a function of % the dependent variable. \begin{figure} <>= op <- par(lwd=2,cex=1.5,las=1,bty="l",yaxs="i") par(mar=c(5,4,2,6)) curve(dnorm(x,0,1),from=-10,to=12, xlab="Value",ylab="Probability density",xlim=c(-14,10), axes=FALSE) col2 = gray(0.7) axis(side=2) axis(side=1,at=seq(-10,10,by=5)) curve(dnorm(x,0,3),add=TRUE,col=1,lty=2) curve(dnorm(x,2,1),add=TRUE,col=col2,lty=1) curve(dnorm(x,2,3),add=TRUE,col=col2,lty=2) par(xpd=NA) #legend(5,0.4, text(c(-1,-3), c(0.3,0.1), adj=1, c(expression(list(mu==0,sigma==1)), expression(list(mu==0,sigma==3)))) text(c(4,6), c(0.3,0.1), adj=0, c(expression(list(mu==2,sigma==1)), expression(list(mu==2,sigma==3))), col=col2) par(op) @ \caption{Normal distribution} \label{fig:normal} \end{figure} \emph{Examples:} practically everything. \textbf{Summary:} \\ \begin{tabular}{ll} range & all real values \\ distribution & $\frac{1}{\sqrt{2 \pi} \sigma} \exp\left(-\frac{(x-\mu)^2}{2 \sigma^2}\right)$ \\ \R & \code{dnorm}, \code{pnorm}, \code{qnorm}, \code{rnorm} \\ parameters & $\mu$ (real), mean [\code{mean}] \\ & $\sigma$ (real, positive), standard deviation [\code{sd}] \\ mean & $\mu$ \\ variance & $\sigma^2$ \\ CV & $\sigma/\mu$ \\ Conjugate prior & Normal ($\mu$); Gamma ($1/\sigma^2$) \end{tabular} \subsubsection{Gamma} \label{sec:gamma} The \emph{Gamma} distribution is the distribution of \emph{waiting times} until a certain number of events take place. For example, $\mbox{Gamma}(\mbox{shape}=3,\mbox{scale}=2)$ is the distribution of the length of time (in days) you'd expect to have to wait for 3 deaths in a population, given that the average survival time is 2~days (mortality rate is 1/2 per day). The mean waiting time is 6 days=(3 deaths/(1/2 death per day)). (While the gamma \emph{function} (\code{gamma} in \R: see Appendix) is usually written with a capital Greek gamma, $\Gamma$, the Gamma distribution (\code{dgamma} in \R) is written out as Gamma.) Gamma distributions with integer shape parameters are also called \emph{Erlang} distributions. The Gamma distribution is still defined for non-integer (positive) shape parameters, but the simple description given above breaks down: how can you define the waiting time until 3.2 events take place? For shape parameters $\le 1$, the Gamma has its mode at zero; for shape parameter $=1$, the Gamma is equivalent to the exponential (see below). For shape parameter greater than 1, the Gamma has a peak (mode) at a value greater than zero; as the shape parameter increases, the Gamma distribution becomes more symmetrical and approaches the normal distribution. This behavior makes sense if you think of the Gamma as the distribution of the sum of independent, identically distributed waiting times, in which case it is governed by the Central Limit Theorem. The scale parameter (sometimes defined in terms of a rate parameter instead, $1/\mbox{scale}$) just adjusts the mean of the Gamma by adjusting the waiting time per event; however, multiplying the waiting time by a constant to adjust its mean also changes the variance, so both the variance and the mean depend on the scale parameter. The Gamma distribution is less familiar than the normal, and new users of the Gamma often find it annoying that in the standard parameterization you can't adjust the mean independently of the variance. You could define a new set of parameters $m$ (mean) and $v$ (variance), with $\mbox{scale}=v/m$ and $\mbox{shape}=m^2/v$ --- but then you would find (unlike the normal distribution) the shape changing as you changed the variance. Nevertheless, the Gamma is extremely useful; it solves the problem that many researchers face when they have a continuous variable with ``too much variance'', whose coefficient of variation is greater than about 0.5. Modeling such data with a normal distribution leads to unrealistic negative values, which then have to be dealt with in some \emph{ad hoc} way like truncating them or otherwise trying to ignore them. The Gamma is often a more realistic alternative. The Gamma is the continuous counterpart of the negative binomial, which is the discrete distribution of a number of trials (rather than length of time) until a certain number of events occur. Both the negative binomial and Gamma distributions are often generalized, however, in ways that don't necessarily make sense according to their simple mechanistic descriptions (e.g. a Gamma distribution with a shape parameter of 2.3 corresponds to the distribution of waiting times until 2.3 events occur \ldots). The Gamma and negative binomial are both commonly used phenomenologically, as skewed or overdispersed versions of the Poisson or normal distributions, rather than for their mechanistic descriptions. The Gamma is less widely used than the negative binomial because the negative binomial replaces the Poisson, which is restricted to a particular variance, while the Gamma replaces the normal, which can have any variance. Thus you might use the negative binomial for any discrete distribution with variance $>$ mean, while you wouldn't need a Gamma distribution unless the distribution you were trying to match was skewed to the right. \begin{figure} <>= op <- par(lwd=2,cex=1.5,las=1,bty="l") curve(dgamma(x,1,1),from=1e-4,to=25, xlab="Value",ylab="Prob. density") cols = c("black","gray") curve(dgamma(x,2,1),add=TRUE,col=cols[1],lty=2,from=0) curve(dgamma(x,5,1),add=TRUE,col=cols[1],lty=3,from=0) curve(dgamma(x,1,1/3),add=TRUE,col=cols[2],lty=1,from=0) curve(dgamma(x,2,1/3),add=TRUE,col=cols[2],lty=2,from=0) curve(dgamma(x,5,1/3),add=TRUE,col=cols[2],lty=3,from=0) par(xpd=NA) legend(5,1, paste("shape=",c(1,2,5,1,2,5),", scale=", rep(c("1","1/3"),c(3,3)),sep=""), col=rep(cols,each=3),lty=rep(1:3,2)) par(op) @ \caption{Gamma distribution} \label{fig:gamma} \end{figure} \textbf{Summary:} \\ \begin{tabular}{ll} range & positive real values \\ \R & \code{dgamma}, \code{pgamma}, \code{qgamma}, \code{rgamma} \\ distribution & $\frac{1}{s^a \Gamma(a)} x^{a-1} e^{-x/s}$ \\ parameters & $s$ (real, positive), scale: length per event [\code{scale}] \\ & \textbf{or} $r$ (real, positive), rate = $1/s$; rate at which events occur [\code{rate}] \\ & $a$ (real, positive), shape: number of events [\code{shape}] \\ mean & $as$ or $a/r$ \\ variance & $as^2$ or $a/r^2$ \\ CV & $1/\sqrt{a}$ \end{tabular} \emph{Examples:} almost any environmental variable with a large variance where negative values don't make sense: nitrogen concentrations, light intensity, etc.. \subsubsection{Exponential} The exponential distribution (Figure~\ref{fig:expdist}) describes the distribution of waiting times for a single event to happen, given that there is a constant probability per unit time that it will happen. It is the continuous counterpart of the geometric distribution and a special case (for shape parameter=1) of the Gamma distribution. It can be useful both mechanistically, as a distribution of inter-event times or lifetimes, or phenomenologically, for any continuous distribution that has highest probability for zero or small values. \begin{figure} <>= par(lwd=2,cex=1.5,las=1,bty="l",yaxs="i",xaxs="i",mgp=c(2.5,1,0)) curve(dexp(x,1),from=1e-4,to=15, xlab="Value",ylab="Probability density") curve(dexp(x,0.5),from=1e-4,add=TRUE,lty=2) curve(dexp(x,0.2),from=1e-4,add=TRUE,lty=3) text(c(0.73,2.17,5.7), c(0.67,0.2,0.11), c(expression(lambda==1), expression(lambda==1/2), expression(lambda==1/5)), col=1,adj=0) @ \caption{Exponential distribution.} \label{fig:expdist} \end{figure} \emph{Examples:} times between events (bird sightings, rainfall, etc.); lifespans/survival times; random samples of anything that decreases exponentially (e.g. light levels in a forest canopy). \textbf{Summary:} \\ \begin{tabular}{ll} range & positive real values \\ \R & \code{dexp}, \code{pexp}, \code{qexp}, \code{rexp} \\ density & $\lambda e^{-\lambda x}$ \\ parameters & $\lambda$ (real, positive), rate: death/disappearance rate [\code{rate}] \\ mean & $1/\lambda$ \\ variance & $1/\lambda^2$ \\ CV & 1 \end{tabular} \subsubsection{Beta} \label{sec:beta} The beta distribution, a continuous distribution closely related to the binomial distribution, completes our basic family of continuous distributions (Figure~\ref{fig:probdisttab}). The beta distribution is the only standard continuous distribution (besides the uniform distribution) with a finite range, from 0 to 1. The beta distribution is the inferred distribution of the \emph{probability} of success in a binomial trial with $a-1$ observed successes and $b-1$ observed failures. When $a=b$ the distribution is symmetric around $x=0.5$, when $ab$ it shifts toward 1. With $a=b=1$, the distribution is $U(0,1)$. As $a+b$ (equivalent to the total number of trials+2) gets larger, the distribution becomes more peaked. For $a$ or $b$ less than 1, the mechanistic description stops making sense (how can you have fewer than zero trials?), but the distribution is still well-defined, and when $a$ and $b$ are both between 0 and 1 it becomes U-shaped --- it has peaks at $p=0$ and $p=1$. The beta distribution is obviously good for modeling probabilities or proportions. It can also be useful for modeling continuous distributions with peaks at both ends, although in some cases a finite mixture model (p.~\pageref{sec:discmix}) may be more appropriate. The beta distribution is also useful whenever you have to define a continuous distribution on a finite range, as it is the only such standard continuous distribution. It's easy to rescale the distribution so that it applies over some other finite range instead of from 0 to 1: for example, \cite{Tiwari+2005} used the beta distribution to describe the distribution of turtles on a beach, so the range would extend from 0 to the length of the beach. \begin{figure} <>= op <- par(lwd=2,cex=1.5,las=1,bty="l",yaxs="i",mgp=c(2.5,1,0), mar=c(5,4,2,5)+0.1) curve(dbeta(x,1,1),from=0,to=1, xlab="Value",ylab="Probability density",lty=1, ylim=c(0,5)) curve(dbeta(x,5,5),add=TRUE,from=0,to=1,col=1,lty=2) curve(dbeta(x,5,1),add=TRUE,from=0,to=1,col=1,lty=3) curve(dbeta(x,1,5),add=TRUE,from=0,to=1,col=1,lty=4) curve(dbeta(x,0.5,0.5),add=TRUE,from=0,to=1,col=1,lty=5) par(xpd=NA) text(0.07,4.15,c("a=1, b=5"),adj=0) text(0.5,2.7,c("a=5, b=5")) text(c(1,1,0.94), c(4.58,2.53,0.83), c("a=5, b=1", "a=0.5, b=0.5", "a=1, b=1"), adj=0) par(xpd=FALSE) par(op) @ \caption{Beta distribution} \label{fig:beta} \end{figure} \textbf{Summary:} \\ \begin{tabular}{ll} range & real, 0 to 1 \\ \R & \code{dbeta}, \code{pbeta}, \code{qbeta}, \code{rbeta} \\ density & $\frac{\Gamma(a+b)}{\Gamma(a)\Gamma(b)} x^{a-1}(1-x)^{b-1}$ \\ parameters & $a$ (real, positive), shape 1: number of successes +1 [\code{shape1}] \\ & $b$ (real, positive), shape 2: number of failures +1 [\code{shape2}] \\ mean & $a/(a+b)$ \\ mode & $(a-1)/(a+b-2)$ \\ variance & $ab/((a+b)^2)(a+b+1)$ \\ CV & $\sqrt{(b/a)/(a+b+1)}$ \\ \end{tabular} \subsubsection{Lognormal} \label{sec:lnorm} The lognormal falls outside the neat classification scheme we've been building so far; it is not the continuous analogue or limit of some discrete sampling distribution (Figure~\ref{fig:probdisttab})\footnote{The lognormal extends our table in another direction --- exponential transformation of a known distribution. Other distributions have this property, most notably the \emph{extreme value distribution}, which is the log-exponential: if $Y$ is exponentially distributed, then log $Y$ is extreme-value distributed. As its name suggests, the extreme value distribution occurs mechanistically as the distribution of extreme values (e.g. maxima) of samples of other distributions \citep{Katz+2005}.}. Its mechanistic justification is like the normal distribution (the Central Limit Theorem), but for the \emph{product} of many independent, identical variates rather than their sum. Just as taking logarithms converts products into sums, taking the logarithm of a lognormally distributed variable---which might result from the product of independent variables---converts it it into a normally distributed variable resulting from the sum of the logarithms of those independent variables. % (You might consider the name of the lognormal to be backwards, % since you have to exponentiate rather than take % the logarithm of a normally % distributed variable to get one that was lognormally distributed \ldots) The best example of this mechanism is the distribution of the sizes of individuals or populations that grow exponentially, with a per capita growth rate that varies randomly over time. At each time step (daily, yearly, etc.), the current size is \emph{multiplied} by the randomly chosen growth increment, so the final size (when measured) is the product of the initial size and all of the random growth increments. One potentially puzzling aspect of the lognormal distribution is that its mean is not what you might naively expect if you exponentiate a normal distribution with mean $\mu$ (i.e. $e^\mu$). Because of Jensen's inequality, and because the exponential function is an accelerating function, the mean of the lognormal, $e^{\mu+\sigma^2/2}$, is greater than $e^\mu$ by an amount that depends on the variance of the original normal distribution. When the variance is small relative to the mean, the mean is approximately equal to $e^\mu$, and the lognormal itself looks approximately normal (e.g. solid lines in Figure~\ref{fig:lognormal}, with $\sigma(\log) = 0.2$). As with the Gamma distribution, the distribution also changes shape as the variance increases, becoming more skewed. The log-normal is also used phenomenologically in some of the same situations where a Gamma distribution also fits: continuous, positive distributions with long tails or variance much greater than the mean \citep{McGill+2006}. Like the distinction between a Michaelis-Menten and a saturating exponential, you may not be able to tell the difference between a lognormal and a Gamma without large amounts of data. Use the one that is more convenient, or that corresponds to a more plausible mechanism for your data. \begin{figure} <>= par(lwd=2,cex=1.5,las=1,bty="l") curve(dlnorm(x,0,0.5),from=0,to=12, xlab="Value",ylab="Probability density",n=300,lty=2, ylim=c(0,2)) curve(dlnorm(x,0,1),add=TRUE,col=1,lty=3,n=300,from=0) curve(dlnorm(x,0,0.2),add=TRUE,col=1,lty=1,n=500,from=0) curve(dlnorm(x,2,0.5),add=TRUE,col=cols[2],lty=2,from=0) curve(dlnorm(x,2,1),add=TRUE,col=cols[2],lty=3,from=0) curve(dlnorm(x,2,0.2),add=TRUE,col=cols[2],lty=1,from=0) legend(3,2, c(expression(paste(mu[log]==0,", ",sigma[log]==0.2)), expression(paste(mu[log]==0,", ",sigma[log]==0.5)), expression(paste(mu[log]==0,", ",sigma[log]==1)), expression(paste(mu[log]==2,", ",sigma[log]==0.2)), expression(paste(mu[log]==2,", ",sigma[log]==0.5)), expression(paste(mu[log]==2,", ",sigma[log]==1))), col=rep(cols,each=3),lty=rep(1:3,2)) @ \caption{Lognormal distribution} \label{fig:lognormal} \end{figure} \emph{Examples:} sizes or masses of individuals, especially rapidly growing individuals; abundance vs. frequency curves for plant communities. \textbf{Summary:} \\ \begin{tabular}{ll} range & positive real values \\ \R & \code{dlnorm}, \code{plnorm}, \code{qlnorm}, \code{rlnorm} \\ density & $\frac{1}{\sqrt{2 \pi} \sigma x} e^{-(\log x - \mu)^2/(2 \sigma^2)}$ \\ parameters & $\mu$ (real): mean of the logarithm [\code{meanlog}] \\ & $\sigma$ (real): standard deviation of the logarithm [\code{sdlog}] \\ mean & $\exp(\mu+\sigma^2/2)$ \\ variance & $\exp(2 \mu + \sigma^2) (\exp(\sigma^2)-1)$ \\ CV & $\sqrt{\exp(\sigma^2)-1}$ ($\approx \sigma$ when $\sigma<1/2$) \end{tabular} \label{lnormsum} \begin{figure} \begin{center} \includegraphics{probdisttab} \end{center} \caption{Relationships among probability distributions.} \label{fig:probdisttab} \end{figure} \section{Extending simple distributions; compounding and generalizing} What do you do when none of these simple distributions fits your data? You could always explore other distributions. For example, the Weibull distribution (similar to the Gamma distribution in shape: \code{?dweibull} in \R) generalizes the exponential to allow for survival probabilities that increase or decrease with age (p.~\pageref{eq:weib}). The Cauchy distribution (\code{?dcauchy} in \R), described as \emph{fat-tailed} because the probability of extreme events (in the tails of the distribution) is very large --- larger than for the exponential or normal distributions --- can be useful for modeling distributions with many outliers. You can often find useful distributions for your data in modeling papers from your subfield of ecology. However, in addition to simply learning more distributions it can also useful to learn some strategies for generalizing more familiar distributions. % \begin{quote} % \textbf{Digression:} % \emph{The two basic assumptions of ``randomness'', for example in % the Poisson distribution, are (1) independence and (2) uniformity. % Individuals are assumed not to affect each other, and the % \emph{expected} arrival rate/population density is assumed to be the % same everywhere. Neither of these assumptions is ever really true. % When we relax one or the other we get distributions such as the % negative binomial, or more complicated distributions. You can tell % from data whether at least one of these assumptions is violated % (e.g. if a data set has variance $>$ mean), but not which one --- you % can't infer the process from the pattern. But you can construct % plausible models that explain the pattern in terms of one process or % another.} % \end{quote} \subsection{Adding covariates} One obvious strategy is to look for systematic differences within your data that explain the non-standard shape of the distribution. For example, a \emph{bimodal} or \emph{multimodal} distribution (one with two or more peaks, in contrast to most of the distributions discussed above that have a single peak) may make perfect sense once you realize that your data are a collection of objects from different populations with different means.% [CITE Holling?] For example, the sizes or masses of sexually dimorphic animals or animals from several different cryptic species would bi- or multimodal distributions, respectively. A distribution that isn't multimodal but is more fat-tailed than a normal distribution might indicate systematic variation in a continuous covariate such as nutrient availability, or maternal size, of environmental temperature, of different individuals. \subsection{Mixture models} But what if you can't identify systematic differences? You can still extend standard distributions by supposing that your data are really a mixture of observations from different types of individuals, but that you can't observe the (finite) types or (continuous) covariates of individuals. These distributions are called \emph{mixture distributions} or \emph{mixture models}. Fitting them to data can be challenging, but they are very flexible. \subsubsection{Finite mixtures} \label{sec:discmix} Finite mixture models suppose that your observations are drawn from a discrete set of unobserved categories, each of which has its own distribution: typically all categories have the same type of distribution, such as normal, but with different mean or variance parameters. Finite mixture distributions often fit multimodal data. Finite mixtures are typically parameterized by the parameters of each component of the mixture, plus a set of probabilities or percentages describing the amount of each component. For example, 30\% of the organisms ($p=0.3$) could be in group~1, normally distributed with mean $1$ and standard deviation $2$, while 70\% ($1-p=0.7$) are in group~2, normally distributed with mean $5$ and standard deviation $1$ (Figure~\ref{fig:discrmix}). If the peaks of the distributions are closer together, or their standard deviations are larger so that the distributions overlap, you'll see a broad (and perhaps lumpy) peak rather than two distinct peaks. \begin{figure} <>= par(lwd=2,cex=1.5,las=1,bty="l") curve(0.7*dnorm(x,1,2)+0.3*dnorm(x,5,1),from=-8,to=12, ylab="Probability density",xlab="") mtext(side=1,"x",line=2,cex=1.5) curve(0.7*dnorm(x,1,2),lty=2,add=TRUE) curve(0.3*dnorm(x,5,1),lty=2,add=TRUE) @ \caption{Finite mixture distribution: 70\% $\mbox{Normal}(\mu=1,\sigma=2)$, 30\% $\mbox{Normal}(\mu=5,\sigma=1)$.} \label{fig:discrmix} \end{figure} \label{zinfl} \emph{Zero-inflated} models are a common type of finite mixture model \citep{Inouye1999,Martin+2005}. Zero-inflated models (Figure~\ref{fig:zi}). combine a standard discrete probability distribution (e.g. binomial, Poisson, or negative binomial), which typically include some probability of sampling zero counts even when some individuals are present, with some additional process that can also lead to a zero count (e.g. complete absence of the species or trap failure). \subsection{Continuous mixtures} \label{sec:compounding} Continuous mixture distributions, also known as \emph{compounded distributions}, allow the parameters themselves to vary randomly, drawn from their own distribution. They are a sensible choice for overdispersed data, or for data where you suspect that unobserved covariates may be important. Technically, compounded distributions are the distribution of a sampling distribution $S(x,p)$ with parameter(s) $p$ that vary according to another (typically continuous) distribution $P(p)$. The distribution of the compounded distribution $C$ is $C(x) = \int S(x,p) P(p) dp$. For example, compounding a Poisson distribution by drawing the rate parameter $\lambda$ from a Gamma distribution with shape parameter $k$ (and scale parameter $\lambda/k$, to make the mean equal to $\lambda$) results in a negative binomial distribution (p.~\pageref{sec:negbinom}). Continuous mixture distributions are growing ever more popular in ecology as ecologists try to account for heterogeneity in their data. The negative binomial, which could also be called the Gamma-Poisson distribution to highlight its compound origin, is the most common compounded distribution. The Beta-binomial is also fairly common: like the negative binomial, it compounds a common discrete distribution (binomial) with its conjugate prior (Beta), resulting in a mathematically simple form that allows for more variability. The \emph{lognormal-Poisson} is very similar to the negative binomial, except that (as its name suggests) it uses the lognormal instead of the Gamma as a compounding distribution. One technical reason to use the less common lognormal-Poisson is that on the log scale the rate parameter is normally distributed, which simplifies some numerical procedures \citep{Elston+2001}. \citet{Clark+1999} used the \emph{Student $t$} distribution to model seed dispersal curves. Seeds often disperse fairly uniformly near parental trees but also have a high probability of long dispersal. These two characteristics are incompatible with standard seed dispersal models like the exponential and normal distributions. Clark \emph{et al.} assumed that the seed dispersal curve represents a compounding of a normal distribution for the dispersal of any one seed with an Gamma distribution of the inverse variance of the distribution of any particular seed (i.e., $1/\sigma^2 \sim \mbox{Gamma}$)\footnote{This choice of a compounding distribution, which may seem arbitrary, turns out to be mathematically convenient.}. This variation in variance accounts for the different distances that different seeds may travel as a function of factors like their size, shape, height on the tree, and the wind speed at the time they are released. Clark \emph{et al.} used compounding to model these factors as random, unobserved covariates since they are practically impossible to measure for all the individual seeds on a tree or in a forest. The inverse Gamma-normal model is equivalent to the Student $t$ distribution, which you may recognize from $t$ tests in classical statistics and which statisticians sometimes use as a phenomenological model for fat-tailed distributions. Clark \emph{et al.} extended the usual one-dimensional $t$ distribution (\code{?dt} in \R) to the two-dimensional distribution of seeds around a parent and called it the $2Dt$ distribution. The $2Dt$ distribution has a scale parameter that determines the mean dispersal distance and a shape parameter $p$. When $p$ is large the underlying Gamma distribution has a small coefficient of variation and the $2Dt$ distribution is close to normal; when $p=1$ the $2Dt$ becomes a Cauchy distribution. \emph{Generalized distributions} are an alternative class of mixture distribution that arises when there is a sampling distribution $S(x)$ for the number of individuals within a cluster and another sampling distribution $C(x)$ for number of clusters in a sampling unit. For example, the distribution of number of eggs per square might be generalized from the distribution of clutches per square and of eggs per clutch. A standard example is the ``Poisson-Poisson'' or ``Neyman Type A'' distribution \citep{Pielou1977}, which assumes a Poisson distribution of clusters with a Poisson distribution of individuals in each. Figuring out the probability distribution or density formulas for compounded distributions analytically is mathematically challenging (see \cite{Bailey1964} or \cite{Pielou1977} for the gory details), but \R\ can easily generate random numbers from these distributions (see the \R\ supplement for more detail). The key is that \R's functions for generating random distributions ({\tt rpois}, {\tt rbinom}, etc.) can take vectors for their parameters. Rather than generate (say) 20 deviates from a binomial distribution with $N$ trials and and a fixed per-trial probability $p$, you can choose 20 deviates with $N$ trials and a vector of 20 different per-trial probabilities $p_1$ to $p_20$. Furthermore, you can generate this vector of parameters from another randomizing function! For example, to generate 20 beta-binomial deviates with $N=10$ and the per-trial probabilities drawn from a beta distribution with $a=2$ and $b=1$, you could use \code{rbinom(20,rbeta(20,2,1))}. Compounding and generalizing are powerful ways to extend the range of stochastic ecological models. A good fit to a compounded distribution also suggests that environmental variation is shaping the variation in the population. But be careful: \cite{Pielou1977} demonstrates that for Poisson distributions, every generalized distribution (corresponding to variation in the underlying density) can also be generated by a compound distribution (corresponding to individuals occurring in clusters), and concludes that (p. 123) ``the fitting of theoretical frequency distributions to observational data can never by itself suffice to `explain' the pattern of a natural population''. \newpage \section{\R\ supplement} For all of the probability distributions discussed in this chapter (and many more: try \code{help.search("distribution")}), \R\ can generate random numbers drawn from the distribution (``deviates''); compute the cumulative distribution function and the probability distribution function; and compute the quantile function, which gives the $x$ value such that $\int_0^x P(x) \, dx$ (area under the curve from 0 to $x$) is a specified value. For example, you can obtain the critical values of the standard normal distribution, $\pm 1.96$, with \code{qnorm(0.025)} and \code{qnorm(0.975)} (Figure~\ref{fig:dpqr}). \begin{figure} <>= sh = 4 op=par(mfrow=c(1,2)) par(mar=c(5.1,4.5,1,0.5),cex=1.5,lwd=2,las=1,bty="l",yaxs="i") curve(dgamma(x,shape=sh),from=0,to=20,ylab="",xlab="",lwd=2,axes=FALSE) mtext(side=1,line=2,cex=2,"x") axis(side=1,labels=FALSE) axis(side=2,labels=FALSE) box() xvec = seq(0,5,length=100) polygon(c(xvec,rev(xvec)),c(rep(0,length(xvec)),dgamma(rev(xvec),shape=sh)), col="gray",border=NA) curve(dgamma(x,shape=sh),from=0,to=20,lwd=2,add=TRUE) ## redraw abline(v=5,lty=3) mtext(side=1,line=0.7,at=5,expression(x[0]),cex=1.5) abline(h=dgamma(5,shape=sh),lty=2,lwd=2) mtext(side=2,at=dgamma(5,shape=sh),las=1,expression(ddist(x[0])), line=1,cex=1.5) text(10,0.1,expression(pdist(x[0])),cex=1) arrows(6.5,0.1,3,0.1,angle=20,lwd=2) mtext(side=2,at=0.07,"Probability\ndensity",cex=2,line=1, las=0) set.seed(1001) r1 <- rgamma(10,shape=sh) points(r1,rep(0.01,10),cex=1.5) text(11.7,0.03,"rdist(10)",adj=0,cex=1) arrows(11,0.03,7,0.02,lwd=2,angle=20) curve(pgamma(x,shape=sh),from=0,to=20,xlab="",ylab="",lwd=2,axes=FALSE) axis(side=1,labels=FALSE) axis(side=2,labels=FALSE) mtext(side=1,line=2,cex=2,"x") box() abline(v=5,lty=3) mtext(side=1,line=0.3,at=5,expression(x[0]),cex=1.5) abline(h=pgamma(5,shape=sh),lty=2,lwd=2) mtext(side=2,at=pgamma(5,shape=sh),las=1,expression(pdist(x[0])), line=1,cex=1.5) abline(v=qgamma(0.95,shape=sh),lty=4,lwd=2) mtext(side=2,at=0.95,las=1,0.95, line=par("mgp")[2],cex=1.5) segments(par("usr")[1],0.95,qgamma(0.95,shape=sh),0.95,lty=4,lwd=2) mtext(side=1,at=qgamma(0.95,shape=sh),text="qdist(0.95)",line=1, cex=1.5,adj=0.3) mtext(side=2,at=0.4,adj=0.5,"Cumulative\ndistribution",cex=2,line=1,las=0) @ \caption{\R\ functions for an arbitrary distribution \code{dist}, showing density function (\code{ddist}), cumulative distribution function (\code{pdist}), quantile function (\code{qdist}), and random-deviate function (\code{rdist}).} \label{fig:dpqr} \end{figure} \subsection{Discrete distribution} For example, let's explore the (discrete) negative binomial distribution. First set the random-number seed for consistency: <<>>= set.seed(1001) @ Arbitrarily choose parameters $\mu=10$ and $k=0.9$ --- % since $k<1$, this represents a strongly overdispersed population. Remember that \R\ uses \code{size} to denote $k$, because $k$ is mathematically equivalent to the number of failures in the failure-process parameterization. <<>>= z <- rnbinom(1000,mu=10,size=0.9) @ Check the first few values: <<>>= head(z) @ Since the negative binomial has no set upper limit, we will just plot the results up to the maximum value sampled: <<>>= maxz <- max(z) @ The easiest way to plot the results is: <<>>= f <- factor(z,levels=0:maxz) plot(f) @ using the \code{levels} specification to make sure that all values up to the maximum are included in the plot even when none were sampled in this particular experiment. If we want the observed probabilities (freq/N) rather than the frequencies: <<>>= obsprobs <- table(f)/1000 plot(obsprobs) @ Add theoretical values: <<>>= tvals <- dnbinom(0:maxz,size=0.9,mu=10) points(0:maxz,tvals) @ You could plot the deviations with \code{plot(0:maxz,obsprobs-tvals)}; this gives you some idea how the variability changes with the mean. Find the probability that $x>30$: <<>>= pnbinom(30,size=0.9,mu=10,lower.tail=FALSE) @ By default \R's distribution functions will give you the \emph{lower tail} of the distribution --- the probability that $x$ is less than or equal to some particular value. You could use \code{1-pnbinom(30,size=0.9,mu=10)} to get the uppper tail since $\prob(x>30)=1-\prob(x\le 30)$, but using \code{lower.tail=FALSE} to get the upper tail is more numerically accurate. What is the upper 95th percentile of the distribution? <<>>= qnbinom(0.95,size=0.9,mu=10) @ To get the lower and upper 95\% confidence limits, you need <<>>= qnbinom(c(0.025,0.975),size=0.9,mu=10) @ You can also use the random sample \code{z} to check that the mean and variance, and 95th quantile of the sample, agree reasonably well with the theoretical expectations: <<>>= mu <- 10 k <- 0.9 c(mu,mean(z)) c(mu*(1+mu/k),var(z)) c(qnbinom(0.95,size=k,mu=mu),quantile(z,0.95)) @ \subsection{Continuous distribution: lognormal} Going through the same exercise for the lognormal, a continuous distribution: <<>>= z <- rlnorm(1000,meanlog=2,sdlog=1) @ Plot the results: <<>>= hist(z,breaks=100,freq=FALSE) lines(density(z,from=0),lwd=2) @ Add theoretical values: <<>>= curve(dlnorm(x,meanlog=2,sdlog=1),add=TRUE,lwd=2,from=0, col="darkgray") @ The probability of $x>20$, 95\% confidence limits: <<>>= plnorm(30,meanlog=2,sdlog=1,lower.tail=FALSE) qlnorm(c(0.025,0.975),meanlog=2,sdlog=1) @ Comparing the theoretical values given on p.~\pageref{lnormsum} with the observed values for this random sample: <<>>= meanlog <- 2 sdlog <- 1 c(exp(meanlog+sdlog^2/2),mean(z)) c(exp(2*meanlog+sdlog^2)*(exp(sdlog^2)-1),var(z)) c(qlnorm(0.95,meanlog=meanlog,sdlog=sdlog),quantile(z,0.95)) @ There is a fairly large difference between the expected and observed variance. This is typical: variances of random samples have larger variances, or absolute differences from their theoretical expected values, than means of random samples. Sometimes it's easier to deal with log-normal data by taking the logarithm of the data and comparing them to the normal distribution: <<>>= hist(log(z),freq=FALSE,breaks=100) curve(dnorm(x,mean=meanlog,sd=sdlog),add=TRUE,lwd=2) @ \begin{table} \begin{tabular}{llclp{2in}} \textbf{Distribution} & \textbf{Type} & \textbf{Range} & \textbf{Skew} & \textbf{Examples} \\ \hline Binomial & Discrete & $0,N$ & any & Number surviving, number killed \\ Poisson & Discrete & $0,\infty$ & right $\to$ none & Seeds per quadrat, settlers (variance/mean $\approx 1$)\\ Negative binomial & Discrete & $0,\infty$ & right & Seeds per quadrat, settlers (variance/mean $> 1$) \\ Geometric & Discrete & $0,\infty$ & right & Discrete lifetimes \\ Normal & Continuous & $-\infty,\infty$ & none & Mass \\ Gamma & Continuous & $0,\infty$ & right & Survival time, distance to nearest edge \\ Exponential & Continuous & $0,\infty$ & right & Survival time, distance to nearest edge \\ Lognormal & Continuous & $0,\infty$ & right & Size, mass (exponential growth) \end{tabular} \caption{Summary of probability distributions} \end{table} \afterpage{\clearpage} \subsection{Mixing and compounding distributions} \subsubsection{Finite mixture distributions} The general recipe for generating samples from finite mixtures is to use a uniform distribution to sample which of the components of the mixture to sample, then use \code{ifelse} to pick values from one distribution or the other. To pick 1000 values from a mixture of normal distributions with the parameters shown in Figure~\ref{fig:discrmix} ($p=0.3$, $\mu_1=1$, $\sigma_1=2$, $\mu_2=5$, $\sigma_2=1$): <<>>= u1 <- runif(1000) z <- ifelse(u1<0.3,rnorm(1000,mean=1,sd=2), rnorm(1000,mean=5,sd=1)) hist(z,breaks=100,freq=FALSE) @ The probability density of a finite mixture composed of two distributions $D_1$ and $D_2$ in proportions $p_1$ and $1-p_1$ is $p_1 D_1+p_2 D_2$. We can superimpose the theoretical probability density for the finite mixture above on the histogram: <<>>= curve(0.3*dnorm(x,mean=1,sd=2)+0.7*dnorm(x,mean=5,sd=1), add=TRUE,lwd=2) @ The general formula for the probability distribution of a zero-inflated distribution, with an underlying distribution $P(x)$ and a zero-inflation probability of $p_z$, is: \begin{eqnarray*} \mbox{Prob}(0) & = & p_z + (1-p_z) P(0) \\ \mbox{Prob}(x>0) & = & (1-p_z) P(x) \end{eqnarray*} So, for example, we could define a probability distribution for a zero-inflated negative binomial as follows: <<>>= dzinbinom = function(x,mu,size,zprob) { ifelse(x==0, zprob+(1-zprob)*dnbinom(0,mu=mu,size=size), (1-zprob)*dnbinom(x,mu=mu,size=size)) } @ (the name, \code{dzinbinom}, follows the \R\ convention for a probability distribution function: a \code{d} followed by the abbreviated name of the distribution, in this case \code{zinbinom} for ``\textbf{z}ero-\textbf{i}nflated \textbf{n}egative \textbf{binom}ial''). \label{zinbinom} The \code{ifelse} command checks every element of \code{x} to see whether it is zero or not and fills in the appropriate value depending on the answer. Here's a random deviate generator: <<>>= rzinbinom = function(n,mu,size,zprob) { ifelse(runif(n)>= k <- 3 mu <- 10 lambda <- rgamma(1000,shape=k,scale=mu/k) z <- rpois(1000,lambda) P1 <- table(factor(z,levels=0:max(z)))/1000 plot(P1) P2 <- dnbinom(0:max(z),mu=10,size=3) points(0:max(z),P2) @ Establish that a Poisson-lognormal and a Poisson-Gamma (negative binomial) are not very different: pick the Poisson-lognormal with approximately the same mean and variance as the negative binomial just shown. % mean and var of gamma: <<>>= mlog <- mean(log(lambda)) sdlog <- sd(log(lambda)) lambda2 <- rlnorm(1000,meanlog=mlog,sdlog=sdlog) z2 <- rpois(1000,lambda2) P3 <- table(factor(z2,levels=0:max(z)))/1000 matplot(0:max(z),cbind(P1,P3),pch=1:2) lines(0:max(z),P2) @