Matt Graham <matt-graham.github.io>
Given some probability distribution defined on a real vector space $\mathbb{R}^D$ by the density function
\begin{equation} \mathbb{p}\left[\mathbf{x} = \boldsymbol{x}\right] \propto \exp \left\lbrace - \phi(\boldsymbol{x}) \right\rbrace, \end{equation}generate a set of samples $\left\lbrace \boldsymbol{x}^{(n)}\right\rbrace_{n=1}^N$ from a Markov chain which has the distribution defined by $\mathbb{p}\left[\mathbf{x} = \boldsymbol{x}\right]$ as its unique invariant measure.
The MCMC samples can then be used to compute Monte Carlo approximations to expectations
\begin{equation} \mathbb{E}\left[\,f(\mathbf{x})\,\right] \approx \frac{1}{M} \sum_{n=1}^N \left\lbrace f\left( \boldsymbol{x}^{(n)} \right) \right\rbrace. \end{equation}Support of distribution is full vector space:
\begin{equation} \mathbb{p}\left[\mathbf{x} = \boldsymbol{x}\right] > 0 \quad\forall \boldsymbol{x} \in \mathbb{R}^N. \end{equation}If support is a bounded subset, can sometimes transform to an equivalent unconstrained space using a variable transform e.g. $\log(u - c)$ if $u > c$.
Define a proposal density $q\left[\boldsymbol{x}' \,|\, \boldsymbol{x}\right]$ we can tractably sample from, generate a sample $\boldsymbol{x}'$ from it given the current state $\boldsymbol{x}$ and then accept the proposal with probability
\begin{equation} a \left[ \boldsymbol{x}' \,|\, \boldsymbol{x} \right] = \min \left\lbrace 1, \frac{q\left[\boldsymbol{x} \,|\, \boldsymbol{x}'\right] \mathbb{p}\left[\mathbf{x} = \boldsymbol{x}'\right]} {q\left[\boldsymbol{x}' \,|\, \boldsymbol{x}\right] \mathbb{p}\left[\mathbf{x} = \boldsymbol{x}\right]} \left|\frac{\partial \boldsymbol{x}'}{\partial \boldsymbol{x}}\right| \right\rbrace. \end{equation}The Jacobian term accounts for a possible change of measure with which the densities are defined with respect to. See Green (1995) or Lan et al. (2012).
If proposal density is symmetric
\begin{equation} q\left[\boldsymbol{x}' \,|\, \boldsymbol{x}\right] = q\left[\boldsymbol{x} \,|\, \boldsymbol{x}'\right] \quad\forall \boldsymbol{x},\, \boldsymbol{x}' \in \mathbb{R}^N, \end{equation}then the acceptance probability reduces to
$$ a \left[ \boldsymbol{x}' \,|\, \boldsymbol{x} \right] = \min \left\lbrace 1, \exp\left[ \phi(\boldsymbol{x}) - \phi(\boldsymbol{x}') \right] \left|\frac{\partial \boldsymbol{x}'}{\partial \boldsymbol{x}}\right| \right\rbrace. $$Key problem is finding proposal density that allows proposes 'large' moves with high probability of acceptance.
Augment state space with a vector 'momentum' variable $\mathbf{p} \in \mathbb{R}^D$ and a signed 'time direction' variable $\mathsf{d} \in \left\lbrace -1,\,+1 \right\rbrace$
If $\mathbf{S}$ is a $2D\times 2D$ constant non-singular skew-symmetric matrix then we can define a Hamiltonian dynamic on the joint system $\boldsymbol{z} = \left[ \boldsymbol{x};\, \boldsymbol{p} \right]$ by
\begin{equation} \frac{\mathrm{d}\boldsymbol{z}}{\mathrm{d} t} = d \, \mathbf{S} \frac{\partial H}{\partial\boldsymbol{z}} ~~\Leftrightarrow~~ \left[ \begin{array}{c} \frac{\mathrm{d}\boldsymbol{x}}{\mathrm{d}t} \\ \frac{\mathrm{d}\boldsymbol{p}}{\mathrm{d}t} \end{array} \right] = d \, \mathbf{S} \left[ \begin{array}{c} \frac{\partial H}{\partial\boldsymbol{x}} \\ \frac{\partial H}{\partial\boldsymbol{p}} \end{array} \right]. \end{equation}A flow map $\Psi_{T,d}\left\lbrace \boldsymbol{z}_0 \right\rbrace = \boldsymbol{z}(T)$ is then defined by the solution $\boldsymbol{z}(T)$ to the initial value problem
\begin{equation} \frac{\mathrm{d}\boldsymbol{z}}{\mathrm{d}t} = \frac{\partial H}{\partial\boldsymbol{z}}, \quad \boldsymbol{z}(0) = \boldsymbol{z}_0, \quad t \in [0, T]. \end{equation}Typically $\mathbf{S} = \left[ \begin{array}{cc} \mathbf{0} & \mathbf{I} \\ -\mathbf{I} & \mathbf{0} \end{array} \right]$ in which case the dynamic is canonical.
Exactly conserves the Hamiltonian $H(\boldsymbol{z})$
\begin{equation} \frac{\mathrm{d}H}{\mathrm{d}t} = \frac{\partial H}{\partial\boldsymbol{z}}^{\mathrm{T}} \frac{\mathrm{d}\boldsymbol{z}}{\mathrm{d}t} = d \frac{\partial H}{\partial\boldsymbol{z}}^{\mathrm{T}} \mathbf{S} \frac{\partial H}{\partial\boldsymbol{z}} = 0 \end{equation}Reversible under negation of $\mathsf{d}$
\begin{equation} \text{If } \boldsymbol{z}' = \Psi_{T,+1} \left\lbrace \boldsymbol{z} \right\rbrace \text{ then } \boldsymbol{z} = \Psi_{T,-1} \left\lbrace \boldsymbol{z}' \right\rbrace. \end{equation}Preserves volume as flow is divergence-free
\begin{equation} \left( \frac{\partial }{\partial\boldsymbol{z}} \right)^{\mathrm{T}} \frac{\mathrm{d}\boldsymbol{z}}{\mathrm{d}t} = \mathrm{Tr} \left[ \mathbf{S} \frac{\partial^2 H}{\partial \boldsymbol{z}\,\partial\boldsymbol{z}^{\mathrm{T}}} \right] = 0 \Rightarrow \frac{\partial \Psi_{T,d}}{\partial\boldsymbol{z}} = \mathbf{I}. \end{equation}The dynamic also has the further property of being symplectic map with respect to the structure matrix $\mathbf{S}$
\begin{equation} \frac{\partial \Psi_{T,d}}{\partial\boldsymbol{z}}^{\mathrm{T}} \mathbf{S}^{-1} \frac{\partial \Psi_{T,d}}{\partial\boldsymbol{z}} = \mathbf{S}^{-1}, \end{equation}see for example Leimkuhler and Reich (2005) for proof and more details.
Symplecticness implies volume preservation but is a more stringent requirment for $D > 1$.
If we therefore define a proposal density
\begin{equation} q\left[\boldsymbol{z}', d' \,|\, \boldsymbol{z}, d\right] = \delta \left[ \Psi_{T,d}\left\lbrace \boldsymbol{z} \right\rbrace - \boldsymbol{z}' \right] \delta \left[ d - (-d') \right], \end{equation}i.e. propose new state by deterministically running Hamiltonian dynamics forward $T$ units of time then reverse time flow, then our acceptance probability will be unity
\begin{equation} \min \left\lbrace 1, \exp\left[ H(\boldsymbol{z}') - H(\boldsymbol{z}) \right] \left|\frac{\partial \boldsymbol{z}'}{\partial\boldsymbol{z}}\right| \right\rbrace = 1. \end{equation}We can then deterministically flip $\mathsf{d}$ (so on the next proposal we won't go back to our previous point) as
$$ \mathbb{P}\left[\mathsf{d}=\pm1\,|\,\mathbf{x}=\boldsymbol{x},\,\mathbf{p}=\boldsymbol{x}\right] = \frac{1}{2}, $$and so this move also leaves the joint density invariant.
This composition of transitions will not be ergodic however in joint state space as we remain confined to the same constant Hamiltonian manifold.
There is a class of integrators which also preserve the symplectic map property of the exact dynamic.
Symplectic integrators have a further useful property:
The standard (and original) implementation of HMC augments the system with variables which are independent of the original state and have a Gaussian conditional / marginal
The derivative $\frac{\partial \tau}{\partial\boldsymbol{p}} = \mathbf{M}^{-1}\boldsymbol{p}$ is now just a linear transform of the $\boldsymbol{p}$ variables which can be considered in analogy to Newtonian mechanics momentum variables with $\mathbf{M}$ a mass matrix
In this case as the distribution on $\mathbf{p}$ is symmetric there is no need to add a further binary direction variable $\mathsf{d}$ as reversibility can be achieved by negating the momentum variables
\begin{equation} \tau(\boldsymbol{p}) = \tau(-\boldsymbol{p}) = \frac{1}{2}\boldsymbol{p}^{\mathrm{T}}\mathbf{M}^{-1}\boldsymbol{p}. \end{equation}Further as $\mathbb{p}\left[\mathbf{p}=\boldsymbol{p}\,|\,\mathbf{x}=\boldsymbol{x}\right]$ is Gaussian we can easily resample the momentum variables between dynamic proposal updates to alter the energy of the system and ensure ergodicity.
For separable Hamiltonians (no terms coupling $\boldsymbol{x}$ and $\boldsymbol{p}$) there is a particularly efficient symplectic integration scheme called the leapgfrog (or Störmer-Verlet) method composed of two step types
Individually each of these steps is volume preserving
\begin{align} \left| \frac{\partial \Phi^A_{\delta t}}{\partial\boldsymbol{z}} \right| &= \left| \begin{array}{cc} \mathbf{I} & \mathbf{0} \\ \dots & \mathbf{I} \end{array} \right| = 1, \\[2mm] \left| \frac{\partial \Phi^B_{\delta t}}{\partial\boldsymbol{z}} \right| &= \left| \begin{array}{cc} \mathbf{I} & \dots \\ \mathbf{0} & \mathbf{I} \end{array} \right| = 1. \end{align}Further any composition of the steps is also volume preserving.
In particular a symmetric composition of the form
\begin{equation} \Phi^{A}_{\frac{1}{2}\delta t} \circ \Phi^{B}_{\delta t} \circ \Phi^{A}_{\frac{1}{2}\delta t} \end{equation}is also time reversible and symplectic.
Overall this gives a single step of leapfrog dynamics as
\begin{align} \boldsymbol{p}^{\left(n + \frac{1}{2}\right)} &= \boldsymbol{p}^{(n)} - \frac{1}{2} \delta t \left. \frac{\partial \phi}{\partial\boldsymbol{x}} \right|_{\,\boldsymbol{x}^{(n)}}\\[2mm] \boldsymbol{x}^{(n+1)} &= \boldsymbol{x}^{(n)} + \delta t \, \mathbf{M}^{-1} \boldsymbol{p}^{\left(n + \frac{1}{2}\right)} \\[2mm] \boldsymbol{p}^{(n+1)} &= \boldsymbol{p}^{\left(n + \frac{1}{2}\right)} - \frac{1}{2} \delta t \left. \frac{\partial \phi}{\partial\boldsymbol{x}} \right|_{\,\boldsymbol{x}^{(n+1)}}\\ \end{align}In practice tend to combine half steps after initial one
\begin{equation} \Phi^{A}_{\frac{1}{2}\delta t} \circ \Phi^{B}_{\delta t} \circ \left\lbrace \Phi^{A}_{\frac{1}{2}\delta t} \circ \Phi^{A}_{\frac{1}{2}\delta t} \right\rbrace \circ \Phi^{B}_{\delta t} \circ \\\dots \left\lbrace \Phi^{A}_{\frac{1}{2}\delta t} \circ \Phi^{A}_{\frac{1}{2}\delta t} \right\rbrace \circ \Phi^{B}_{\delta t} \circ \Phi^{A}_{\frac{1}{2}\delta t} \end{equation}\begin{equation} = \Phi^{A}_{\frac{1}{2}\delta t} \circ \Phi^{B}_{\delta t} \circ \Phi^{A}_{\delta t} \circ \Phi^{B}_{\delta t} \dots \Phi^{A}_{\delta t} \circ \Phi^{B}_{\delta t} \circ \Phi^{A}_{\frac{1}{2}\delta t}. \end{equation}
Metropolis-Hastings updates with Hamiltonian dynamics proposals alone will not generally ensure ergodicity - constrained to near constant Hamiltonian surface.
Overcome by alternating with a different Markov transition operator which leaves joint distribution invariant.
In particular we can use any transition which leaves the conditional on the momenta given the positions invariant (c.f. Gibbs sampling). In case of standard HMC, momenta are independent of positions therefore resample independently from Gaussian distribution
\begin{equation} \boldsymbol{p} \sim \mathcal{N}\left(\cdot; \boldsymbol{0}, \mathbf{M}\right). \end{equation}More general update with partial momentum refreshal from Horowitz (1991) parameterised by $\theta \in \left[0,\,\frac{\pi}{2}\right]$
\begin{equation} \boldsymbol{p}' = \cos(\theta)\, \boldsymbol{n} + \sin(\theta)\, \boldsymbol{p} \quad \text{with } \boldsymbol{n} \sim \mathcal{N}\left(\cdot; \boldsymbol{0}, \mathbf{M}\right). \end{equation}