Hamiltonian Monte Carlo

A brief(ish) introduction

Matt Graham <matt-graham.github.io>

Preliminaries

Task

Given some probability distribution defined on a real vector space $\mathbb{R}^D$ by the density function

\begin{equation} \mathbb{p}\left[\mathbf{x} = \boldsymbol{x}\right] \propto \exp \left\lbrace - \phi(\boldsymbol{x}) \right\rbrace, \end{equation}

generate a set of samples $\left\lbrace \boldsymbol{x}^{(n)}\right\rbrace_{n=1}^N$ from a Markov chain which has the distribution defined by $\mathbb{p}\left[\mathbf{x} = \boldsymbol{x}\right]$ as its unique invariant measure.

The MCMC samples can then be used to compute Monte Carlo approximations to expectations

\begin{equation} \mathbb{E}\left[\,f(\mathbf{x})\,\right] \approx \frac{1}{M} \sum_{n=1}^N \left\lbrace f\left( \boldsymbol{x}^{(n)} \right) \right\rbrace. \end{equation}

Assumptions

  • Support of distribution is full vector space:

    \begin{equation} \mathbb{p}\left[\mathbf{x} = \boldsymbol{x}\right] > 0 \quad\forall \boldsymbol{x} \in \mathbb{R}^N. \end{equation}

    If support is a bounded subset, can sometimes transform to an equivalent unconstrained space using a variable transform e.g. $\log(u - c)$ if $u > c$.

  • Density function is everywhere differentiable with respect to $\boldsymbol{x}$ and the gradients can be tractably computed.

Metropolis-Hastings - quick recap

Define a proposal density $q\left[\boldsymbol{x}' \,|\, \boldsymbol{x}\right]$ we can tractably sample from, generate a sample $\boldsymbol{x}'$ from it given the current state $\boldsymbol{x}$ and then accept the proposal with probability

\begin{equation} a \left[ \boldsymbol{x}' \,|\, \boldsymbol{x} \right] = \min \left\lbrace 1, \frac{q\left[\boldsymbol{x} \,|\, \boldsymbol{x}'\right] \mathbb{p}\left[\mathbf{x} = \boldsymbol{x}'\right]} {q\left[\boldsymbol{x}' \,|\, \boldsymbol{x}\right] \mathbb{p}\left[\mathbf{x} = \boldsymbol{x}\right]} \left|\frac{\partial \boldsymbol{x}'}{\partial \boldsymbol{x}}\right| \right\rbrace. \end{equation}

The Jacobian term accounts for a possible change of measure with which the densities are defined with respect to. See Green (1995) or Lan et al. (2012).

If proposal density is symmetric

\begin{equation} q\left[\boldsymbol{x}' \,|\, \boldsymbol{x}\right] = q\left[\boldsymbol{x} \,|\, \boldsymbol{x}'\right] \quad\forall \boldsymbol{x},\, \boldsymbol{x}' \in \mathbb{R}^N, \end{equation}

then the acceptance probability reduces to

$$ a \left[ \boldsymbol{x}' \,|\, \boldsymbol{x} \right] = \min \left\lbrace 1, \exp\left[ \phi(\boldsymbol{x}) - \phi(\boldsymbol{x}') \right] \left|\frac{\partial \boldsymbol{x}'}{\partial \boldsymbol{x}}\right| \right\rbrace. $$

Key problem is finding proposal density that allows proposes 'large' moves with high probability of acceptance.

Abstract description of
Hamiltonian Monte Carlo

Augment state space

Augment state space with a vector 'momentum' variable $\mathbf{p} \in \mathbb{R}^D$ and a signed 'time direction' variable $\mathsf{d} \in \left\lbrace -1,\,+1 \right\rbrace$

\begin{align} \mathbb{p}\left[\mathbf{p} = \boldsymbol{p} \,|\,\mathbf{x} = \boldsymbol{x}\right] &\propto \exp\left\lbrace - \tau(\boldsymbol{x}, \boldsymbol{p}) \right\rbrace, \\ \mathbb{P}\left[\mathsf{d} = d\,|\,\mathbf{x} = \boldsymbol{x},\,\mathbf{p} = \boldsymbol{x}\right] &= \frac{1}{2}, \end{align}
\begin{equation} \therefore \quad \mathbb{p}\left[\mathbf{x} = \boldsymbol{x},\,\mathbf{p}=\boldsymbol{p},\,\mathsf{d}=d\right] \propto \exp\underbrace{\left\lbrace -\phi(\boldsymbol{x}) - \tau(\boldsymbol{x}, \boldsymbol{p})\right\rbrace}_{-H(\boldsymbol{x}, \boldsymbol{p})}. \end{equation}

Hamiltonian dynamic in augmented state space

If $\mathbf{S}$ is a $2D\times 2D$ constant non-singular skew-symmetric matrix then we can define a Hamiltonian dynamic on the joint system $\boldsymbol{z} = \left[ \boldsymbol{x};\, \boldsymbol{p} \right]$ by

\begin{equation} \frac{\mathrm{d}\boldsymbol{z}}{\mathrm{d} t} = d \, \mathbf{S} \frac{\partial H}{\partial\boldsymbol{z}} ~~\Leftrightarrow~~ \left[ \begin{array}{c} \frac{\mathrm{d}\boldsymbol{x}}{\mathrm{d}t} \\ \frac{\mathrm{d}\boldsymbol{p}}{\mathrm{d}t} \end{array} \right] = d \, \mathbf{S} \left[ \begin{array}{c} \frac{\partial H}{\partial\boldsymbol{x}} \\ \frac{\partial H}{\partial\boldsymbol{p}} \end{array} \right]. \end{equation}

A flow map $\Psi_{T,d}\left\lbrace \boldsymbol{z}_0 \right\rbrace = \boldsymbol{z}(T)$ is then defined by the solution $\boldsymbol{z}(T)$ to the initial value problem

\begin{equation} \frac{\mathrm{d}\boldsymbol{z}}{\mathrm{d}t} = \frac{\partial H}{\partial\boldsymbol{z}}, \quad \boldsymbol{z}(0) = \boldsymbol{z}_0, \quad t \in [0, T]. \end{equation}

Typically $\mathbf{S} = \left[ \begin{array}{cc} \mathbf{0} & \mathbf{I} \\ -\mathbf{I} & \mathbf{0} \end{array} \right]$ in which case the dynamic is canonical.

Properties of dynamic

  • Exactly conserves the Hamiltonian $H(\boldsymbol{z})$

    \begin{equation} \frac{\mathrm{d}H}{\mathrm{d}t} = \frac{\partial H}{\partial\boldsymbol{z}}^{\mathrm{T}} \frac{\mathrm{d}\boldsymbol{z}}{\mathrm{d}t} = d \frac{\partial H}{\partial\boldsymbol{z}}^{\mathrm{T}} \mathbf{S} \frac{\partial H}{\partial\boldsymbol{z}} = 0 \end{equation}
  • Reversible under negation of $\mathsf{d}$

    \begin{equation} \text{If } \boldsymbol{z}' = \Psi_{T,+1} \left\lbrace \boldsymbol{z} \right\rbrace \text{ then } \boldsymbol{z} = \Psi_{T,-1} \left\lbrace \boldsymbol{z}' \right\rbrace. \end{equation}
  • Preserves volume as flow is divergence-free

    \begin{equation} \left( \frac{\partial }{\partial\boldsymbol{z}} \right)^{\mathrm{T}} \frac{\mathrm{d}\boldsymbol{z}}{\mathrm{d}t} = \mathrm{Tr} \left[ \mathbf{S} \frac{\partial^2 H}{\partial \boldsymbol{z}\,\partial\boldsymbol{z}^{\mathrm{T}}} \right] = 0 \Rightarrow \frac{\partial \Psi_{T,d}}{\partial\boldsymbol{z}} = \mathbf{I}. \end{equation}

The dynamic also has the further property of being symplectic map with respect to the structure matrix $\mathbf{S}$

\begin{equation} \frac{\partial \Psi_{T,d}}{\partial\boldsymbol{z}}^{\mathrm{T}} \mathbf{S}^{-1} \frac{\partial \Psi_{T,d}}{\partial\boldsymbol{z}} = \mathbf{S}^{-1}, \end{equation}

see for example Leimkuhler and Reich (2005) for proof and more details.

Symplecticness implies volume preservation but is a more stringent requirment for $D > 1$.

Exact Hamiltonian Monte Carlo

If we therefore define a proposal density

\begin{equation} q\left[\boldsymbol{z}', d' \,|\, \boldsymbol{z}, d\right] = \delta \left[ \Psi_{T,d}\left\lbrace \boldsymbol{z} \right\rbrace - \boldsymbol{z}' \right] \delta \left[ d - (-d') \right], \end{equation}

i.e. propose new state by deterministically running Hamiltonian dynamics forward $T$ units of time then reverse time flow, then our acceptance probability will be unity

\begin{equation} \min \left\lbrace 1, \exp\left[ H(\boldsymbol{z}') - H(\boldsymbol{z}) \right] \left|\frac{\partial \boldsymbol{z}'}{\partial\boldsymbol{z}}\right| \right\rbrace = 1. \end{equation}

We can then deterministically flip $\mathsf{d}$ (so on the next proposal we won't go back to our previous point) as

$$ \mathbb{P}\left[\mathsf{d}=\pm1\,|\,\mathbf{x}=\boldsymbol{x},\,\mathbf{p}=\boldsymbol{x}\right] = \frac{1}{2}, $$

and so this move also leaves the joint density invariant.

This composition of transitions will not be ergodic however in joint state space as we remain confined to the same constant Hamiltonian manifold.

Concrete implementation

Simulating Hamiltonian dynamics in practice

  • In reality for most systems of interest we cannot compute the flow map $\Psi_{T,d}$ exactly and so have to resort to discretisation and numerical integration.
  • Importantly there are numerical integration schemes which define an approximate flow map $\tilde{\Psi}_{T,d}$ which conserve the volume-preservation and reversibility properties of the exact dynamic $\Psi_{T,d}$.
  • In general Hamiltonian no longer exactly conserved under discretisation so will be some rejections.

There is a class of integrators which also preserve the symplectic map property of the exact dynamic.

Symplectic integrators have a further useful property:

  • If discretised dynamic is stable they exactly integrate the dynamic of some 'nearby' Hamiltonian.
  • This is bounded to be within a fixed distance (depending on $\delta t$ the discretisation time step) of the original Hamiltonian
  • Therefore can integrate dynamics over long time periods with high probability of acceptance.

Euclidean Manifold HMC [1]

The standard (and original) implementation of HMC augments the system with variables which are independent of the original state and have a Gaussian conditional / marginal

$$ \mathbb{p}\left[\mathbf{p}=\boldsymbol{p}\,|\,\mathbf{x}=\boldsymbol{x}\right] = \mathbb{p}\left[\mathbf{p}=\boldsymbol{p}\right] \propto \exp \underbrace{\left\lbrace - \frac{1}{2}\boldsymbol{p}^{\mathrm{T}}\mathbf{M}^{-1}\boldsymbol{p}\right\rbrace}_{-\tau(\boldsymbol{p})}. $$
1: Named for consistency versus Riemannian Manifold HMC. Naming scheme borrowed from Betancourt (2012).

The derivative $\frac{\partial \tau}{\partial\boldsymbol{p}} = \mathbf{M}^{-1}\boldsymbol{p}$ is now just a linear transform of the $\boldsymbol{p}$ variables which can be considered in analogy to Newtonian mechanics momentum variables with $\mathbf{M}$ a mass matrix

\begin{equation} \mathbf{M} \frac{\mathrm{d}^2\boldsymbol{x}}{\mathrm{d}t^2} = \boldsymbol{f}(\boldsymbol{x}) ~\Leftrightarrow~ \frac{\mathrm{d}\boldsymbol{x}}{\mathrm{d}t} = \mathbf{M}^{-1} \boldsymbol{p} ~~~ \frac{\mathrm{d}\boldsymbol{p}}{\mathrm{d}t} = \boldsymbol{f}({x}) = -\frac{\partial \phi}{\partial\boldsymbol{x}}. \end{equation}

In this case as the distribution on $\mathbf{p}$ is symmetric there is no need to add a further binary direction variable $\mathsf{d}$ as reversibility can be achieved by negating the momentum variables

\begin{equation} \tau(\boldsymbol{p}) = \tau(-\boldsymbol{p}) = \frac{1}{2}\boldsymbol{p}^{\mathrm{T}}\mathbf{M}^{-1}\boldsymbol{p}. \end{equation}

Further as $\mathbb{p}\left[\mathbf{p}=\boldsymbol{p}\,|\,\mathbf{x}=\boldsymbol{x}\right]$ is Gaussian we can easily resample the momentum variables between dynamic proposal updates to alter the energy of the system and ensure ergodicity.

Leapfrog updates

For separable Hamiltonians (no terms coupling $\boldsymbol{x}$ and $\boldsymbol{p}$) there is a particularly efficient symplectic integration scheme called the leapgfrog (or Störmer-Verlet) method composed of two step types

\begin{align} \Phi^{A}_{\delta t} \left[ \begin{array}{c} \boldsymbol{x} \\ \boldsymbol{p} \end{array} \right] &= \left[ \begin{array}{c} \boldsymbol{x} \\ \boldsymbol{p} - \delta t\, \frac{\partial \phi}{\partial\boldsymbol{x}} \end{array} \right], \\[2mm] \Phi^{B}_{\delta t} \left[ \begin{array}{c} \boldsymbol{x} \\ \boldsymbol{p} \end{array} \right] &= \left[ \begin{array}{c} \boldsymbol{x} + \delta t\, \mathbf{M}^{-1} \boldsymbol{p} \\ \boldsymbol{p} \\ \end{array} \right]. \end{align}

Individually each of these steps is volume preserving

\begin{align} \left| \frac{\partial \Phi^A_{\delta t}}{\partial\boldsymbol{z}} \right| &= \left| \begin{array}{cc} \mathbf{I} & \mathbf{0} \\ \dots & \mathbf{I} \end{array} \right| = 1, \\[2mm] \left| \frac{\partial \Phi^B_{\delta t}}{\partial\boldsymbol{z}} \right| &= \left| \begin{array}{cc} \mathbf{I} & \dots \\ \mathbf{0} & \mathbf{I} \end{array} \right| = 1. \end{align}

Further any composition of the steps is also volume preserving.

In particular a symmetric composition of the form

\begin{equation} \Phi^{A}_{\frac{1}{2}\delta t} \circ \Phi^{B}_{\delta t} \circ \Phi^{A}_{\frac{1}{2}\delta t} \end{equation}

is also time reversible and symplectic.

Overall this gives a single step of leapfrog dynamics as

\begin{align} \boldsymbol{p}^{\left(n + \frac{1}{2}\right)} &= \boldsymbol{p}^{(n)} - \frac{1}{2} \delta t \left. \frac{\partial \phi}{\partial\boldsymbol{x}} \right|_{\,\boldsymbol{x}^{(n)}}\\[2mm] \boldsymbol{x}^{(n+1)} &= \boldsymbol{x}^{(n)} + \delta t \, \mathbf{M}^{-1} \boldsymbol{p}^{\left(n + \frac{1}{2}\right)} \\[2mm] \boldsymbol{p}^{(n+1)} &= \boldsymbol{p}^{\left(n + \frac{1}{2}\right)} - \frac{1}{2} \delta t \left. \frac{\partial \phi}{\partial\boldsymbol{x}} \right|_{\,\boldsymbol{x}^{(n+1)}}\\ \end{align}

In practice tend to combine half steps after initial one

\begin{equation} \Phi^{A}_{\frac{1}{2}\delta t} \circ \Phi^{B}_{\delta t} \circ \left\lbrace \Phi^{A}_{\frac{1}{2}\delta t} \circ \Phi^{A}_{\frac{1}{2}\delta t} \right\rbrace \circ \Phi^{B}_{\delta t} \circ \\\dots \left\lbrace \Phi^{A}_{\frac{1}{2}\delta t} \circ \Phi^{A}_{\frac{1}{2}\delta t} \right\rbrace \circ \Phi^{B}_{\delta t} \circ \Phi^{A}_{\frac{1}{2}\delta t} \end{equation}

\begin{equation} = \Phi^{A}_{\frac{1}{2}\delta t} \circ \Phi^{B}_{\delta t} \circ \Phi^{A}_{\delta t} \circ \Phi^{B}_{\delta t} \dots \Phi^{A}_{\delta t} \circ \Phi^{B}_{\delta t} \circ \Phi^{A}_{\frac{1}{2}\delta t}. \end{equation}

Resampling momenta to ensure ergodicity

Metropolis-Hastings updates with Hamiltonian dynamics proposals alone will not generally ensure ergodicity - constrained to near constant Hamiltonian surface.

Overcome by alternating with a different Markov transition operator which leaves joint distribution invariant.

In particular we can use any transition which leaves the conditional on the momenta given the positions invariant (c.f. Gibbs sampling). In case of standard HMC, momenta are independent of positions therefore resample independently from Gaussian distribution

\begin{equation} \boldsymbol{p} \sim \mathcal{N}\left(\cdot; \boldsymbol{0}, \mathbf{M}\right). \end{equation}

More general update with partial momentum refreshal from Horowitz (1991) parameterised by $\theta \in \left[0,\,\frac{\pi}{2}\right]$

\begin{equation} \boldsymbol{p}' = \cos(\theta)\, \boldsymbol{n} + \sin(\theta)\, \boldsymbol{p} \quad \text{with } \boldsymbol{n} \sim \mathcal{N}\left(\cdot; \boldsymbol{0}, \mathbf{M}\right). \end{equation}

References and further reading