## Sunday, March 26, 2017

### You have nothing to lose but your (Markov) chains!

I've been playing around with Markov Chain Monte Carlo so decided to take a Coursera course that deals with them (Statistical Mechanics allows you plot configurations of objects that are as probable as if you took a snapshot of a classical Newtonian system).

"Starting from a given initialdistribution, a Markov chain defines a sequence of probability distributions over states of the system. For any Markov chain satisfying certain mathematical conditions, the distribution at successive states converges to a single distribution. This single distribution is called the stationary distribution... Much of the art of designing an MH algorithm is in choosing the proposal distribution." PPP (p343).

"The Metropolis algorithm takes a random walk in a probability space."

"The only difference between Metropolis and Metropolis-Hastings is that the first ones always sample from a symmetric distribution and thus they don't have the Hastings correction ratio." - see here

What is a Markov chain?

"If we think of Xt as a state X at time t and invoke the following dependency condition on each state:

Pr(Xt+1 = xt+1 | Xt = xt, Xt-1 = xt-1, ..., X0 = x0) = Pr(Xt+1 = xt+1 | Xt = xt)

then the stochastic process is known as a Markov chain" (from here)

What is a Monte Carlo simulation?

"The Monte Carlo method is a statistical - almost experimental - approach to computing integrals using random positions, called samples, whose distribution is carefully chosen." [Krauth]

In the two Coursera examples, pi is calculated by:

1. throwing stones into a square and observing the ratio that are inside and outside a circle that just fits in the square (direct sampling).
2. wandering around the space throwing stones and observing how high the pile is for each grid reference (Markov chain sampling). Note, there are interesting restrictions when a stone is tossed over the boundary of the space.

Note that both approaches approximate an integral in multiple dimensions, namely:

∫ 1-11-1 p(x, y) O(x, y) dx dy
_______________________________

∫ 1-1∫ 1-1 p(x, y) dx dy

where p is the probability of being at point (x, y) and O is the observed size of the pile of stones at (x, y).

Detailed balance condition

This is a sufficient condition for Monte Carlo. The argument goes like this:

Imagine moving around a 3x3 grid exploring all the squares. If we are on the square in the top right corner (call it 'a') and we can only move horizontally and vertically then the probabilities for the next move are obviously:

p(a → a) + p(a → b) + p(a → c) = 1                (1)

(where 'b' and 'c' are the square immediately to the left and immediately below 'a')

Similarly, I can only get to 'a' via 'a', 'b' and 'c' so:

πa . 1 = πb p(b → a) + πc p(c → a) + πa p(a → a)   (2)

substitute equation 1 into 2:

πa p(a → b) + πa p(a → c) = πb p(b → a) + πc p(c → a)

There are many ways to satisfy the global balance condition but one way is to say the first term on the left hand side equals the first in the right hand side and the second term on the left hand side equals the second term on the right hand side.

This is the detailed balance condition:

πa p(a → b) = πb p(b → a)
πa p(a → c) = πc p(c → a)

where πx is the probability of being on square x.

If we want to sweep out all the configurations equally, we want:

πa = πb = πc

(This is not true in the Metropolis algorithm. Metropolis et al proposed another function for πx)

This implies p(a → b) =  p(b → a).

What's interesting is that "the Monte Carlo algorithm is nothing but a matrix, the transfer matrix of transition probabilities p(a→b) for moving from the site a to the site b." In this 3x3 game, the matrix is 9x9 (the 9 squares from which we move to 9 squares to which we can move).

The transfer matrix

In Python, it looks like this:

from numpy import dot, matrix, set_printoptions
from numpy.linalg import eig

m = matrix('2  1  0  1  0  0  0  0  0;'
'1  1  1  0  1  0  0  0  0;'
'0  1  2  0  0  1  0  0  0;'
'1  0  0  1  1  0  1  0  0;'
'0  1  0  1  0  1  0  1  0;'
'0  0  1  0  1  1  0  0  1;'
'0  0  0  1  0  0  2  1  0;'
'0  0  0  0  1  0  1  1  1;'
'0  0  0  0  0  1  0  1  2') / 4.0

This conforms to a mathematics 'design pattern'. Chase the problem onto known territory and deal with it there with the tools designed by other people.

Where it gets cool is when we take the eigen-decomposition (a.k.a spectral decomposition) of this matrix. This says that for a square matrix, A, we can express it as:

A = Q  Λ Q-1

where the columns of Q are the eigenvectors and Λ is a diagonal matrix whose values are the eigenvalues (a.k.a the spectrum of A).

In Python, we can find Q and Λ by doing this:

eigenvalues, eigenvectors = eig(m)

And why is this interesting? Because if we keep multiplying the transfer matrix with the vector representing the probabilities of which square we're currently in, it converges to the stationary distribution.

How do we prove this? Well, let's say we multiply A by itself x times:

Ax = (Q  Λ Q-1)x =  Q  Λ Q-1 Q  Λ Q-1 Q  Λ Q-1...  Q  Λ Q-1 Q  Λx Q-1

so, we only have to times the eigenvalues by themselves x times. That makes the calculations more efficient.

What are the eigenvalues?

set_printoptions(precision=2)
set_printoptions(suppress=True)
print eigenvalues

[-0.5   0.    1.    0.75  0.5   0.25 -0.    0.25  0.75]

Note that all but one value is less than 1.0 so if we keep multiplying a vector of these values by itself, all the values other than the third will disappear.

Now, let's look at the eigenvectors (Q):

print eigenvectors

[[ 0.17 -0.41 -0.33  0.58  0.5   0.33  0.22  0.04  0.11]
[-0.33  0.41 -0.33  0.29  0.   -0.17  0.12 -0.52 -0.23]
[ 0.17  0.   -0.33 -0.   -0.5   0.33 -0.34  0.04 -0.57]
[-0.33  0.41 -0.33  0.29 -0.   -0.17 -0.57  0.48  0.34]
[ 0.67 -0.   -0.33  0.   -0.   -0.67  0.   -0.08 -0.  ]
[-0.33 -0.41 -0.33 -0.29  0.   -0.17  0.57  0.48 -0.34]
[ 0.17  0.   -0.33 -0.   -0.5   0.33  0.34  0.04  0.57]
[-0.33 -0.41 -0.33 -0.29  0.   -0.17 -0.12 -0.52  0.23]
[ 0.17  0.41 -0.33 -0.58  0.5   0.33 -0.22  0.04 -0.11]]
[[ 0.17  0.   -0.33 -0.   -0.5   0.33 -0.34  0.04 -0.57]]

and note that the third column (corresponding to the third eigenvalue) is uniform. All other columns will disappear as their corresponding eigvalues, when multiplied by themselves a sufficient number of times, tend to 0. And that's why, after a sufficient number of matrix multiplications, we'll reach equilibrium.

Let's see it in action by multiplying the matrix by itself 10 times:

for i in range(10):
m = dot(m, m)

print m

[[ 0.11  0.11  0.11  0.11  0.11  0.11  0.11  0.11  0.11]
[ 0.11  0.11  0.11  0.11  0.11  0.11  0.11  0.11  0.11]
[ 0.11  0.11  0.11  0.11  0.11  0.11  0.11  0.11  0.11]
[ 0.11  0.11  0.11  0.11  0.11  0.11  0.11  0.11  0.11]
[ 0.11  0.11  0.11  0.11  0.11  0.11  0.11  0.11  0.11]
[ 0.11  0.11  0.11  0.11  0.11  0.11  0.11  0.11  0.11]
[ 0.11  0.11  0.11  0.11  0.11  0.11  0.11  0.11  0.11]
[ 0.11  0.11  0.11  0.11  0.11  0.11  0.11  0.11  0.11]
[ 0.11  0.11  0.11  0.11  0.11  0.11  0.11  0.11  0.11]]

(Ir)Reducible matrices

We could easily extend our 3x3 box into a 6x6 box and make the transfer matrix such that only the top left hand 3x3 square and bottom right hand 3x3 square were possibly territories. If there were no means to transition from one such territory to the other, this would violate the Markov Chain principle outlined above. If we ever found ourselves in one quadrant, we'd know with 100% certainty that we'd never visited nor ever will visit the other.

"One of the two mathematically rigorous conditions, for a Monte Carlo Markov chain algorithm is that this pulling apart is not possible, that would be irreducible."

One way of looking at this is picturing the matrix as a graph. Usually, we picture graphs as adjacency matrices but if we do it the other way around we'd see this matrix as two disjoint sets of vertices. In linear algebra jargon, the matrix is reducible into these two sets. In graph theory jargon, we'd say the graph was not strongly connected. They are equivalent concepts.