Negative Binomial as Poisson Mixture
I have recently found it useful having the negative binomial (NB) distribution represented as a continuous mixture distribution. This makes it straightforward to understand how the NB distribution relates to the Poisson distribution (how the Poisson assumptions can be relaxed to allow for overdispersion in count data regression). Also, the Bayesian Poisson-Gamma mixture model is nice to illustrate the concept of penalized complexity priors.
Consider a Bayesian setup with random variables $y$, $v$, $\zeta$ and constants $\mu,\alpha > 0$ such that $$ \begin{align} y\vert\zeta \sim \textup{Pois}(\zeta), \quad \zeta = \nu\mu, \quad \nu \sim \textup{Gamma}(\alpha, \alpha), \label{eq:gpmix} \end{align} $$ where $y$ is Poisson distributed, conditional on the rate parameter $\zeta$. Further up the hierarchy, $\alpha$ determines both shape and rate of the gamma prior for $\nu$ and we have $$ y\vert\zeta \sim \textup{Pois}(\zeta), \quad \zeta \sim \textup{Gamma}(\alpha, \alpha/\mu). $$
The above is known as a Poisson-gamma mixture. The unconditional distribution of the data $\pi(y)$ is obtained from marginalizing the joint distribution $\pi(y,\zeta)=\pi(y\vert\zeta)\pi(\zeta)$ over the gamma mixing distribution $\pi(\zeta)$:
$$ \begin{align} \pi(y) &= \, \int_0^{\infty} \pi(y\vert\zeta)\pi(\zeta)\mathrm{d}\zeta \notag\\ &= \, \int_0^{\infty} \frac{\zeta^y e^{-\zeta}}{y!} \cdot \frac{(\alpha/\mu)^{\alpha}}{\Gamma(\alpha)} \zeta^{\alpha - 1} e^{-\zeta\alpha/\mu} \mathrm{\zeta} \notag\\ &= \, \frac{(\alpha/\mu)^\alpha}{\Gamma(\alpha)y!} \int_0^{\infty}\zeta^{\alpha + y - 1} e^{-\zeta(1+\alpha/\mu)}\mathrm{d}\zeta \notag\\ &= \, \frac{(\alpha/\mu)^\alpha}{\Gamma(\alpha)y!} \Gamma(\alpha+y) \left(1+\frac{\alpha}{\mu}\right)^{-(\alpha+y)} \label{eq:gammaint}\\ &= \, \frac{\Gamma(\alpha+y)}{\Gamma(\alpha)y!} \left(\frac{\alpha}{\mu}\right)^{\alpha}\left(\frac{\mu}{\mu+\alpha}\right)^{\alpha+y} \notag\\ &= \, \frac{\Gamma(\alpha+y)}{\Gamma(\alpha)y!} \left(\frac{\alpha}{\mu + \alpha}\right)^\alpha \left(\frac{\mu}{\mu + \alpha}\right)^y \label{eq:NB} \end{align} $$
Eq. \eqref{eq:NB} means that our Poisson-gamma mixture distribution for $y$ is a negative binomial distribution with $\textup{E}(y) = \mu$ and $\textup{Var}(y) = \mu + \phi\mu^2$. Here $\phi=1/\alpha$ is the overdispersion and $\alpha$ is the size parameter.1 We write this as $y \sim \textup{NB}(\phi, \mu)$. From \eqref{eq:NB} we may infer that $\pi(y)$ approaches a Poisson distribution with rate $\mu$ as $\alpha\rightarrow\infty$ (or equivalently $\phi\rightarrow 0$), i.e., in the limiting case without overdispersion.2
Eq. \eqref{eq:NB} is the default parameterisation used by the R package INLA
(Bayesian modelling with Integrated Nested Laplace Approximation) for modelling data with the NB distribution.
Bayesian Negative Binomial Regression in R-INLA
To turn $y \sim \textup{NB}(\phi, \mu)$ into a (generalized) regression to explain observed counts by covariates $\boldsymbol{X}$, the mean $\mu_i$ is linked to the linear predictor
$$ \eta_i = \boldsymbol{X}\boldsymbol{\beta} $$
via the log link function $\log(\mu_i) = \log(E_i) + \eta_i$ where $E_i$ is the exposure (offset) for observation $i$.3
The parameters $\boldsymbol{\beta}$ often receive weakly-informative Gaussian prior distributions in a Bayesian framework. In R-INLA, the size parameter $\alpha$ is modelled as
$$ \begin{align*} \alpha = \exp(\theta) \end{align*} $$
and a prior for $\theta$ can be set by the user. Thus for the overdispersion parameter $\phi$ we have
$$ \theta = \log(\phi^{-1}). $$
While the penalized complexity (PC) prior for $\phi$ is referenced in INLA::inla.doc("pc.gamma")
, it’s definition and handling is… well, kind of unclear. Some insight on the prior’s derivation and properties is given below.
PC prior for $\phi$
The concept of PC priors is to penalize deviations from a simple model (the base model) with a constant rate, see Daniel Simpson et al. (2017a). In Section 4 of the rejoinder for Daniel Simpson et al. (2017a), Daniel Simpson et al. (2017b) present (but not derive) the PC prior for $\phi$ as
$$ \begin{align} \pi(\phi) &=\, \frac{\lambda}{\phi^2} \frac{\lvert\psi^{’}(\phi^{-1}) - \phi\rvert}{\sqrt{2\log(\phi^{-1}) - 2\psi(\phi^{-1})}}\cdot \exp\left\{-\lambda \sqrt{2\log(\phi^{-1}) - 2\psi(\phi^{-1})} \right\}\label{eq:phipc}, \end{align} $$
with $\psi(\cdot)$ the digamma function and $\psi^{’}(\cdot)$ its first derivative. $\lambda$ is a scaling parameter that governs the informativeness of the prior (more on this below).
How is this prior derived?
Let $\pi(x\vert\xi = t)$ denote a (complex) model for $x$ with hyperparameter $\xi$. Daniel Simpson et al. (2017b) propose to measure model complexity as a distance $d(t)$ between the model and a base model $\pi(x\vert\xi_0)$, with $\xi_0$ the corresponding baseline value of $\xi$. The distance $d(t)$ is derived from the Kullback-Leibler Divergence (KLD)4,
$$ d(t) = \sqrt{2\cdot\text{KLD}(\pi(x\vert\xi = t)\ \vert\vert\ \pi(x\vert\xi_0))}. $$
A PC prior for $\xi$ obtains by putting an exponential prior on $d(t)$ and transforming back to the space of $\xi$.
Since the NB distribution is a Poisson-gamma mixture, the “complexity” comes from the gamma mixing distribution of $\nu$ in in \eqref{eq:gpmix}: We have a model space of $\textup{Gamma}(\phi^{-1},\phi^{-1})$ models. Our base model yields an NB model without overdispersion, which is approached as $\phi\rightarrow \infty$: In this limiting case, $\nu$ thas has a degenerate gamma distribution at $1$—the Poisson-gamma mixture reduces to a pure Poisson model.
It turns out that
$$ \begin{align} \tilde d(\phi) &=\, \sqrt{2\log(\phi^{-1})-2\psi(\phi^{-1})}, \label{eq:kldgammamodel} \end{align} $$
where $\tilde d(\phi)$ is a renormalized variant of $d(\phi)$.5 The PC prior $\pi(\phi)$ in \eqref{eq:phipc} obtains from change-of-variables after imposing an exponential distribution with rate parameter $\lambda$ on $\tilde d(\phi)$,
$$ \begin{align} \pi(\phi) = \lambda \exp\{-\lambda \tilde d(\phi)\} \left\lvert\frac{\partial \tilde d(\phi)}{\partial\phi}\right\rvert. \label{eq:pcphicof} \end{align} $$
Plugging \eqref{eq:kldgammamodel} in \eqref{eq:pcphicof} yields the expression of the PC prior in \eqref{eq:phipc}.
How is this prior justified?
In general, a PC prior penalizes deviations from the simpler base model measured in terms of the KLD between the flexible model and this base. Penalization of complexity means that the prior favours simpler models by putting much probability mass close to the parameter(s) yielding the base model and decays (quickly) as we move to points in the parameter space that imply increased model complexity.
An exponential distribution prior on $\tilde d(\cdot)$ yields a constant-rate penalization of model complexity. This means that the relative change in prior density remains constant regardless of how far we deviate from the base model. The exponential form implements a key principle: each additional “unit of model complexity” should receive equal penalization. As noted by Daniel Simpson et al. (2017a), deviating from this constant rate would mean assigning different decay rates to different areas of the parameter space–but this requires domain knowledge about the appropriate distance scale for a particular problem, which we often lack.6
By placing a PC prior on $\phi$ in the NB model, we encode the belief that simpler models (with minimal overdispersion) are preferable unless the data strongly suggests otherwise. The penalization directly corresponds to the deviation of $\phi$ from its base value (where the model is Poisson), aligning the prior with the expected variability of the data.
We can check that the density in \eqref{eq:phipc} matches the definition of the PC prior in INLA
, which is implemented on the natural log scale.
# check definition of pc prior for phi
INLA::inla.pc.dgamma
function (x, lambda = 1, log = FALSE)
{
inv.x <- 1/x
d <- sqrt(2 * (log(inv.x) - psigamma(inv.x)))
ld <- -lambda * d - log(d) + log(lambda) - 2 * log(x) + log(psigamma(inv.x,
deriv = 1) - x)
return(if (log) ld else exp(ld))
}
<bytecode: 0x11f1aff08>
<environment: namespace:INLA>
Here’s a plot of the pc prior density in \eqref{eq:phipc} for $\lambda = 1$. Note that the density is highest for small values and decreases rapidly to zero for larger values of $\phi$.
# plot pc prior of phi for lambda = 1
ggplot() +
geom_function(
fun = ~ INLA::inla.pc.dgamma(., log = F, lambda = 1),
colour = "red",
xlim = c(.01, 4),
) +
labs(x = "$\\phi$", y = "pi(phi)")
Example using simulated overdispersed data
How can the PC prior be used in INLA
? We first generate overdispersed data. Since the implementation of the negative binomial likelihood uses the size $\alpha$, we set up our DGP correspondingly.
set.seed(4321)
n <- 100
# simulate NB distributed data
X1 <- rnorm(n, sd = 0.2)
eta <- 1 + X1
mu <- exp(eta)
size <- .1 # so overdispersion = 10
Y <- rnbinom(n, size = size, mu = mu)
dat <- tibble::tibble(Y, X1)
We next transform the data into a long format and create side-by-side histograms of our predictor (X1
) and response variable (Y
). Tidyverse functions help reshape the data and plotting the distributions.
library(dplyr)
library(ggplot2)
library(tidyr)
library(purrr)
# transform to long format
dat_long <- dat %>%
pivot_longer(everything(), names_to = "var")
# plot histograms of simulated data
ggplot(mapping = aes(x = value)) +
geom_histogram(
data = dat_long %>% filter(var == "X1"),
fill = alpha("steelblue", .5),
col = "white",
linewidth = .2
) +
geom_histogram(
data = dat_long %>% filter(var == "Y"),
fill = alpha("steelblue", .5),
col = "white", linewidth = .2
) +
facet_wrap(~ var, scales = "free") +
scale_x_continuous(name = "", expand = c(.02, 0)) +
scale_y_continuous(expand = c(.02, 0))
Next, we fit a negative binomial regression model using INLA. Note that the formula Y ~ 1 + X1
specifies the correct model. The line family = "nbinomial"
tells INLA
to use a negative binomial likelihood.
For this model, the default prior used for the overdispersion parameter $\phi$ is the PC prior in \eqref{eq:phipc}. This is referred to as pc.mgamma
in the documentation. This implementation accounts for the parametrization of the negative binomial likelihood, which uses a log link for the size $\alpha$,
$$ \begin{align*} \alpha =&\, \exp(\theta) \end{align*} $$
and expects a prior on $\theta$.7 We have
$$ \begin{align*} \frac{1}{\phi} = \exp(\theta)\quad\Leftrightarrow\quad\theta = -\log(\phi), \end{align*} $$
and pc.mgamma
obtains from this change of variables.
We first estimate the NB model with the default setting and specify that the output should include the model selection criteria DIC and WAIC.8
library(INLA)
# estimate NB regression model using INLA
r <- inla(
formula = Y ~ 1 + X1,
data = dat,
family = "nbinomial",
control.compute = list(waic = T, dic = T)
)
summary(r)
Time used:
Pre = 1.56, Running = 0.299, Post = 0.0128, Total = 1.87
Fixed effects:
mean sd 0.025quant 0.5quant 0.975quant mode kld
(Intercept) 0.950 0.232 0.496 0.950 1.409 0.950 0
X1 0.311 1.363 -2.373 0.312 2.986 0.312 0
Model hyperparameters:
mean sd 0.025quant
size for the nbinomial observations (1/overdispersion) 0.212 0.038 0.146
0.5quant 0.975quant
size for the nbinomial observations (1/overdispersion) 0.209 0.297
mode
size for the nbinomial observations (1/overdispersion) 0.203
Deviance Information Criterion (DIC) ...............: 326.34
Deviance Information Criterion (DIC, saturated) ....: 96.75
Effective number of parameters .....................: 2.73
Watanabe-Akaike information criterion (WAIC) ...: 327.28
Effective number of parameters .................: 3.20
Marginal log-Likelihood: -181.04
is computed
Posterior summaries for the linear predictor and the fitted values are computed
(Posterior marginals needs also 'control.compute=list(return.marginals.predictor=TRUE)')
How should we choose $\lambda$ in the PC prior? The default is $\lambda = 7$. We may quantify implications of this choice by the probability that $\phi$ exceeds some threshold $q$, $P(\phi > q)=r_q$. For example:
# tail prob.: P(phi > .2 | lambda = 7)
1 - INLA::inla.pc.pgamma(q = .2, lambda = 7)
[1] 0.04109788
Thus, using $\lambda = 7$ we place substantial weight near the base model. When there are many observations, the data will likely outweigh the tendency of the prior to favour models with less overdispersion, but in my example choosing smaller $\lambda$ may improve model fit.
# tail prob.: P(phi > 10) ~ .05
1 - INLA::inla.pc.pgamma(q = 10, lambda = .75)
[1] 0.04813756
I’ve chosen $\lambda=.75$ here. This encodes my belief that there is quite some overdispersion. We may specify $\lambda$ in INLA::inla()
by adding a corresponing entry hyper
to the control.family
list argument.
nbin_pc <- inla(
formula = Y ~ 1 + X1,
data = dat,
family = "nbinomial",
control.family = list(
hyper = list(
theta = list(
prior = "pc.mgamma",
param = .75 # <- this is lambda
)
)
),
control.compute = list(waic = T, dic = T)
)
summary(nbin_pc)
Time used:
Pre = 1.4, Running = 0.256, Post = 0.00691, Total = 1.67
Fixed effects:
mean sd 0.025quant 0.5quant 0.975quant mode kld
(Intercept) 0.976 0.282 0.426 0.975 1.536 0.975 0
X1 0.262 1.654 -3.004 0.265 3.509 0.265 0
Model hyperparameters:
mean sd 0.025quant
size for the nbinomial observations (1/overdispersion) 0.142 0.031 0.09
0.5quant 0.975quant
size for the nbinomial observations (1/overdispersion) 0.139 0.21
mode
size for the nbinomial observations (1/overdispersion) 0.134
Deviance Information Criterion (DIC) ...............: 323.13
Deviance Information Criterion (DIC, saturated) ....: 73.80
Effective number of parameters .....................: 2.90
Watanabe-Akaike information criterion (WAIC) ...: 323.29
Effective number of parameters .................: 2.69
Marginal log-Likelihood: -164.74
is computed
Posterior summaries for the linear predictor and the fitted values are computed
(Posterior marginals needs also 'control.compute=list(return.marginals.predictor=TRUE)')
Both models fit the data quite well have posterior distributions centered close to the true parameter values. WIAC and DIC suggest somewhat better fit for the nbin_pc
model.
We next generate 10,000 random samples from marginal posterior distributions for the intercept, the coefficient on $X_1$, and $1/\phi$ in the nbin_pc
model and visualize using density estimates to get some insights into uncertainty and distribution shapes.
list(
"Intercept" = nbin_pc$marginals.fixed[[2]],
"X1" = nbin_pc$marginals.fixed[[1]],
"1/phi (size)" = nbin_pc$marginals.hyperpar[[1]]
) %>%
map_dfc(~ inla.rmarginal(n = 1e4, marginal = .x)) %>%
pivot_longer(everything(), names_to = "parameter") %>%
ggplot(aes(x = value)) +
geom_density(
fill = alpha("steelblue", .5),
color = "steelblue",
show.legend = F) +
scale_x_continuous(expand = expansion(.2)) +
scale_y_continuous(expand = c(.02, 0)) +
ylab("density") +
xlab("parameter") +
facet_wrap(~ parameter, scales = "free")
That’s it for now :).9
References
Simpson, Dan. 2022. “Priors Part 4: Specifying Priors That Appropriately Penalise Complexity,” September. https://dansblog.netlify.app/2022-08-29-priors4/2022-08-29-priors4.html.
Simpson, Daniel, Håvard Rue, Andrea Riebler, Thiago G Martins, and Sigrunn H Sørbye. 2017a. “Penalising Model Component Complexity: A Principled, Practical Approach to Constructing Priors.” Statistical Science 32 (1): 1–28.
———. 2017b. “You Just Keep on Pushing My Love over the Borderline: A Rejoinder.” Statistical Science 32 (1): 44–46.
-
The step leading to \eqref{eq:gammaint} follows from a property of the gamma function. ↩︎
-
Using Stirling’s approximation for large values of the gamma function, $\Gamma(x)\approx\sqrt{2\pi} x^{x-1/2} e^{-x}$, one may show that $\pi(y)\to\frac{\mu^y}{y!} e^{-\mu}$ for $\alpha\to0$. ↩︎
-
In both Poisson and negative binomial regression, exposure ($E_i$) represents the time period or “population” size over which counts are observed. For example, in epidemiology, exposure might be person-years at risk. ↩︎
-
KLD measures the difference between two probability distributions by quantifying the amount of information lost when approximating one distribution with another. ↩︎
-
This renormalization of $d(\phi)$ is necessary since we compute KLD to a singular base model, which implies the distance to diverge. ↩︎
-
Daniel Simpson et al. (2017a) lists 4 principles motivating PC priors; here, I focus on understanding their mathematical form. ↩︎
-
Log transformation are common as they lead to more stable computations in optimization and inference algorithms. ↩︎
-
Note that the exposure is $E_i = 1$ here, which is also the default in
inla()
. ↩︎ -
Update Nov. 2022: Dan Simpson (2022) himself provides a great not too technical motivation for PC priors and more examples. ↩︎