## Saturday, September 24, 2016

### Probabilistic Programming - Part 1

Probabilistic programming is an exciting, new-ish area that is eating a lot of my spare time.

"Probabilistic Programming and Bayesian Methods for Hackers" is an excellent book I am playing with. Not only is it free (see here), you can download it and read it as an IPython book (with the command ipython notebook) and tinker with the equations (use shift-enter to execute them).

As you might have gathered, it's Python-based but there are moves to integrate the library it uses extensively (PyMC) with Spark.

In the Java/Scala world, Figaro is making in-roads. In this first part post, I'm making notes on the Python implementation. In part two, I'm hoping to implement a Scala/Figaro version.

Bayes again

Bayesian probability requires a different way of thinking.

If frequentist and Bayesian inference were programming functions, with inputs being statistical problems, then the two would be different in what they return to the user. The frequentist inference function would return a number, representing an estimate (typically a summary statistic like the sample average etc.), whereas the Bayesian function would return probabilities
For example, in our debugging problem above, calling the frequentist function with the argument "My code passed all tests; is my code bug-free?" would return a YES. On the other hand, asking our Bayesian function "Often my code has bugs. My code passed all tests; is my code bug-free?" would return something very different: probabilities of YES and NO. The function might return:
YES, with probability 0.8; NO, with probability 0.2
This is very different from the answer the frequentist function returned. Notice that the Bayesian function accepted an additional argument:  "Often my code has bugs". This parameter is the prior. By including the prior parameter, we are telling the Bayesian function to include our belief about the situation.
The formula for Bayesian probability goes like this:

P(A|X) = P(X|A) P(A)
P(X)

Which reads in English as: the probability of A given X is equal to the probability of X give A multiplied by the probability of A by itself and divided by the probability of X by itself.

P(A|X)is called our posterior and P(A) is called our prior.

Probability distributions refresher

Probability distributions can be discrete (integers) or continuous (real numbers). The distribution of a discrete variable is called a probability mass function, a continuous variable has a probability density function.

One (of many) discrete functions is the Poisson distribution that dates back centuries and was derived to estimate the probabilities of cavalry horses losing a shoe in (or some other independent event). It looks like:

P(Z = k) = λk e
k!

and this has the convenient property that its expected value is λ (or, in maths speak, E[Z|λ]λ).

So, this looks as good a distribution as any for our prior. The trouble is, what is λ? Aye, there's the rub. λ can be anything (an integer or any real number), so to model it, we can choose an exponential distribution. It is continuous and it has some convenience. The probability density function looks like:

f(z|λ) = λ e-λz ,  z >= 0

and the convenient fact is that E[Z|λ] = λ-1.

PyMC

The task is to deduce when (if at all) somebody's behavior changed given a list of the number of text messages they sent over a certain time period. To do this, the book uses PyMC, a Python library.

The workhorse of the code is the generation of simulated data:

@pm.deterministic
def lambda_(tau=tau, lambda_1=lambda_1, lambda_2=lambda_2):
out = np.zeros(n_count_data)
out[:tau] = lambda_1  # lambda before tau is lambda1
out[tau:] = lambda_2  # lambda after (and including) tau is lambda2
return out

which simply creates an array where the elements are one of two values. All the values before tau are lambda_1 and the rest are lambda_2. Simples.

"Inside the deterministic decorator [@pm.deterministic], the Stochastic variables passed in behave like scalars or Numpy arrays (if multivariable), and not like Stochastic variables."

By putting some print statements in, you can see that this function gets hit thousands of times when we feed it into the Poisson function that is given to a Model function that is passed to a MCMC function (Markov Chain Monte Carlo). Phew! The details can be found in the freely downloadable book and I'll explain more when I show the Figaro implementation.

"MCMC returns samples from the posterior distribution, not the distribution itself... MCMC performs a task similar to repeatedly asking "How likely is this pebble I found to be from the mountain I am searching for?", and completes its task by returning thousands of accepted pebbles in hopes of reconstructing the original mountain. In MCMC and PyMC lingo, the returned sequence of "pebbles" are the samples, cumulatively called the traces."

And it's these traces that (when graphed) can help us to see when it is most likely that behavior changed. I'll go into more detail with the Scala implementation in part 2.