When working with distance measures between distributions, singularities can pose a significant challenge. This happens when one of the distributions is degenerate, concentrating all its probability mass on a single point. In this post we discussed the comparison of a Gamma distribution (representing a complex model) with a singular Gamma distribution (representing a base model) in the context of constructing a penalized complexity (PC) prior for the overdispersion parameter $\phi$ in a Bayesian negative binomial regression. In said post, I stated that the distance measure of interest for constructing the PC prior is
$$ \begin{align} \tilde d(\phi) &=\, \sqrt{2\log(\phi^{-1})-2\psi(\phi^{-1})}. \label{eq:kldgammamodel} \end{align} $$
I hinted that $\tilde d(\phi)$ is based on a normalization that addresses the divergence of the Kullback-Leibler divergence (KLD) involving a degenerate distribution.
In this post, I’ll dicuss why the degenerate base model makes it not straightforward to use the KLD for comparing these distributions and show how to derive a renormalized distance upon which Simpson et al. (2017b) construct their PC prior for $\phi$.
The Problem
We are tasked with comparing a Gamma distribution $P \sim \text{Gamma}(\phi^{-1}, \phi^{-1})$ to a “baseline” Gamma distribution $Q \sim \text{Gamma}(\phi_0^{-1}, \phi_0^{-1})$. The challenge arises when $\phi_0^{-1} \to 0$, which makes $Q$ degenerate, with all its probability mass concentrated at $1$. To see why, write out the KLD between the two distributions,
$$ \begin{align} \begin{split} \text{KLD}(P \Vert Q) =&\, (\phi^{-1} - \phi_0^{-1})\psi(\phi^{-1}) - \log\Gamma(\phi^{-1}) + \log\Gamma(\phi_0^{-1})\\ +&\, \phi_0^{-1} \log\left(\frac{\phi^{-1}}{\phi_0^{-1}}\right) + \phi^{-1}\frac{\phi_0^{-1} - \phi^{-1}}{\phi^{-1}}, \end{split}\label{eq:kldgammabase} \end{align} $$
where $\psi(\cdot)$ is the digamma function and $\Gamma(\cdot)$ is the gamma function. As $\phi_0^{-1} \to 0$ (for some fixed $\phi$), the term $\log\Gamma(\phi_0^{-1})$ diverges (since the Gamma function has a pole at $0$) and so does the KLD, for any distribution $P$.
$\epsilon$-Distance
Remember that the measure we are interested in is of the form
$$ \begin{align} d(a) = \sqrt{2\cdot \text{KLD}(\text{P}(a) \Vert \text{Q}(b))}. \end{align} $$
To obtain a meaningful distance in our example, we must renormalize the expression, as suggested in Appendix A of Simpson et al. (2017a). To this end, we replace $\phi_0$ with a small positive value $\epsilon > 0$. This gives us an “$\epsilon$-distance”:
$$ \begin{align} d_\epsilon(\phi) := \sqrt{2\cdot \text{KLD}(\text{Gamma}(\phi^{-1}, \phi^{-1}) \Vert \text{Gamma}(\epsilon^{-1}, \epsilon^{-1}))}. \end{align} $$
Substituting $\phi_0 = \epsilon$ in \eqref{eq:kldgammabase}, the KLD becomes
$$ \begin{align} \text{KLD}(P_{\phi} \parallel P_{\epsilon}) &=\, (\phi^{-1} - \epsilon^{-1}) \psi(\phi^{-1}) - \textcolor{#E69F00}{\log \Gamma(\phi^{-1}) + \log \Gamma(\epsilon^{-1})} \\ &\quad + \epsilon^{-1} \log\left( \frac{\phi^{-1}}{\epsilon^{-1}} \right) + \epsilon^{-1} - \phi^{-1}. \end{align} $$
Note that $\text{KLD}(P_{\phi} \parallel P_{\epsilon})$ is finite for any $\epsilon > 0$. However, as $\epsilon \to 0^+$ the expression diverges.
Let’s inspect the behaviour of the individual terms as $\epsilon \to 0^+$ and thus $\epsilon^{-1}\to\infty$:
$$ \begin{align} \text{KLD}(P_{\phi} \parallel P_{\epsilon}) \sim&\, \phi^{-1} \psi(\phi^{-1}) - \epsilon^{-1} \psi(\phi^{-1}) \notag \\ &\,\textcolor{#E69F00}{-\,\phi^{-1} \log \phi^{-1} + \phi^{-1} + \epsilon^{-1} \log \epsilon^{-1} - \epsilon^{-1} } \notag \\ &+\, \epsilon^{-1} \log \phi^{-1} - \epsilon^{-1} \log \epsilon^{-1} \notag \\ &+\, \epsilon^{-1} - \phi^{-1}.\notag\\ \notag\\ =&\, \phi^{-1} \psi(\phi^{-1}) - \epsilon^{-1} \psi(\phi^{-1})\notag\\ &-\phi^{-1} \log \phi^{-1} + \epsilon^{-1} \log \phi^{-1}\notag\\ \notag\\ =&\,\epsilon^{-1}\left[ \log \phi^{-1} - \psi(\phi^{-1}) \right] + \textup{finite terms} \label{eq:herestheorder} \end{align} $$
The orange terms come from the Stirling’s approximation of the gamma function $\Gamma(\cdot)$. The second block follows from cancelation and rearranging. In the last block we factor out $\epsilon$ and gather the finite terms (those only depending on $\phi$).
Renormalization
From \eqref{eq:herestheorder} we see that the KLD is $\mathcal{O}(\epsilon^{-1})$ and thus for the $\epsilon$-distance $d_{\epsilon}(\phi)$ we find
$$ \begin{align*} d_{\epsilon}(\phi) =&\, \sqrt{2\cdot\text{KLD}(P_{\phi} \parallel P_{\epsilon})}\\ \approx&\, \sqrt{2\cdot\epsilon^{-1} [\log \phi^{-1} - \psi(\phi^{-1})] + \textup{finite terms}}\\ =&\, \mathcal{O}(\epsilon^{-1/2}). \end{align*} $$
What now? Well, in such cases Simpson et al. (2017a) suggest to renormalize the $\epsilon$-distance $d_{\epsilon}(\phi)$:
$$ \begin{align*} \tilde{d}_\epsilon(\phi) := \epsilon^{p/2} d_\epsilon(\phi), \end{align*} $$
where $p$ is chosen such that $\epsilon^{p/2}$ cancels with divergent term in $d_{\epsilon}(\phi)$. Since the divergent part of $d_\epsilon(\phi)$ scales as $\epsilon^{-1/2}$ in our case, we set $p = 1$. Now we take the limit $\epsilon \to 0^+$,
$$ \begin{align*} \tilde{d}(\phi) :=&\, \lim_{\ \epsilon \to 0^+} \tilde{d}_\epsilon(\phi)\\ =&\,\sqrt{2\cdot[\log \phi^{-1} - \psi(\phi^{-1})]}, \end{align*} $$
where the finite terms (those depending on $\phi$ only) from before vanish as $\epsilon^{1/2}\to0$. This is how we obtain the desired renormalized distance $\tilde{d}(\phi)$ 😊.
Update Nov. 2022
Simpson (2022) provides a useful explanation how to interpret the renormalized distance $\tilde{d}(\phi)$:
For large arguments, the digamma function $\psi(\phi^{-1})$ has the approximation $$ \psi(\phi^{-1}) = \log(\phi^{-1}) - \frac{\phi}{2} + \mathcal{O}(\phi^2). $$
If $\phi$ is small ($\phi^{-1}$ large), we can use this approximation to write:
$$ \psi(\phi^{-1}) \approx \log \phi^{-1} - \frac{1}{2} \phi. $$
Substituting back into $\tilde{d}(\phi)$ yields
$$ \begin{align*} \tilde{d}(\phi) \approx&\, \sqrt{2\cdot \left[ \log \phi^{-1} - \left( \log \phi^{-1} - \frac{1}{2} \phi \right) \right] }\\ =&\, \sqrt{ \phi }. \end{align*} $$
Since the variance of the r.v. $\nu\sim\textup{Gamma}(\phi^{-1},\phi^{-1})$ is $\phi$, $\tilde{d}(\phi)$ is approximately the standard deviation of $\nu$ for small $\phi$.
References
Simpson, D. (2022). Priors part 4: Specifying priors that appropriately penalise complexity. https://dansblog.netlify.app/2022-08-29-priors4/2022-08-29-priors4.html
Simpson, D., Rue, H., Riebler, A., Martins, T. G., & Sørbye, S. H. (2017a). Penalising model component complexity: A principled, practical approach to constructing priors. Statistical Science, 32(1), 1–28.
Simpson, D., Rue, H., Riebler, A., Martins, T. G., & Sørbye, S. H. (2017b). You just keep on pushing my love over the borderline: A rejoinder. Statistical Science, 32(1), 44–46.