## Friday, November 11, 2016

### Conjugate Priors in Bayesian Probability

Apropos the recent election, I was reading Allen Downey's blog about Nate Silver and Bayesian forecasting. This piqued my interest in probability distributions (that I forgot about a quarter century ago). Here are some notes.

Binomial Distribution

This is a distribution that describes a sequence of n independent yes/no experiments each with a probability of p. It is a discrete since you obviously can't have, say, 3.25 successes.

The formula for the binomial probability mass function (PMF) is:

nCk pk (1 - p)n-k

where

n is the number of trials
k is the number of successes
p is the probability of success.

Beta Distribution

"The beta distribution is defined on the interval from 0 to 1 (including both), so it is a natural choice for describing proportions and probabilities.". It is (obviously) a continuous function.

The probability density function (PDF) for the beta distribution is:

xα - 1(1 - x)β - 1
________________

B(α, β)

where α and β are hyperparameters and B(α, β) is:

Γ(α) Γ(β)
_________

Γ(α + β)

where Γ(n) is the gamma function and it's simply defined as:

Γ(n) = (n - 1)!

OK, that sounds a lot more complicated than it is but any decent coder could write all this up in a few lines of their favourite language.

One last thing, the sum of the hyperparameters is the total number of trials. That is

α + β = n

Bayes: Binomial and Beta distribution

And this is where it gets interesting. Recall that the posterior distribution is proportional to the likelihood function and the prior. Or, in maths speak:

P(A | X) ∝ P(X | A) P (A)

"It turns out that if you do a Bayesian update with a binomial likelihood function ... the beta distribution is a conjugate prior. That means that if the prior distribution for x is a beta distribution, the posterior is also a beta distribution." [1]

How do we know this? Well take the equation for the binomial's PMF and multiply it by the beta's PDF. But, to make things easier, note that nCk and B(α, β) are constants and can be ignored in a "proportional to" relationship. So, it becomes:

P(a | x) ∝ xk (1 - x)n-k xα - 1(1 - x)β - 1

which simplifies to:

P(a | x) ∝ xα+k-1 (1 - x)n-k+β - 1

Which is another beta distribution! Only this time

α' = α + k - 1

and

β' = n - k + β - 1

Proofs for other combinations can be found here, more information here, a better explanation by somebody smarter than myself is here, and possibly the most informative answer on StackExchange ever here.

[1] Think Bayes, Downey