Sunday, April 9, 2017

Generating a Gaussian distribution


Firstly, let's consider what a Gaussian is. It can be derived here. The definition Dr Wilson gives is: "Data are said to be normally distributed if the rate at which the frequencies fall off is proportional to the distance of the score from the mean, and to the frequencies themselves". Or, in maths speak:

df = k (x - μ) f(x)
dx

where k is a constant and μ is the mean. Integrating this formula and setting it to 1 (since it's a probability distribution) via integration by parts and integration by substitution, you'll eventually get:

f(x) =   e (x - μ)2/2 σ 2 
       σ √2π

which is the Gaussian, where σ is the standard deviation.

Now, squaring it and integrating it twice still equals 1 (since obviously 12 = 1) and looks like this:

(1/2π) ∫-∞-∞  e-x2/2 e-y2/2  dx dy

where I've absorbed the constants into variables and used both x and y for each integration. The trick is now to use polar co-ordinates, so:

x     = r cos θ
y     = r sin θ
dx dy = r dr dθ

and that integral becomes:

(1/2π) ∫0  ∫0  r e-r2/2  dr

which, with another integration by substitution with ψ = r2/2, becomes:

(1/2π) ∫0  ∫0  e  dψ

and again with Υ = e.

(1/2π) ∫0  ∫10  Υ dΥ = (1/2π)2π x 1 = 1

Where Transformation Sampling comes in

"We don't care about this integral. What we do is we use our relationship between integration and sampling" [Werner Krauth at Coursera] where θ is a random number between 0 and  and Υ is a random number between 0 and 1.

So, now we work backwards. If ψ = r2/2 and Υ = e, then r =  √ (- 2 ln Υ) where Υ is sampled over the interval [0,1]. The easy bit is θ which is simply sampled in the interval [0, ].

Plugging these numbers back into the equations for polar co-ordinates, we can now give samples for x and y. Notice that you actually get 2 values (x and y) for the price of one. So, implementations (like java.lang.Random.nextGaussian) maintain state (the next Gaussian value if it's available) which is calculated for "free" - except that the method is synchronized...

Implementation

What we have above is a decent implementation for generating Gaussian values but it's inefficient. There is a nice math-hack to make it better.

Java Number Cruncher (p367) says:
We begin this algorithm by generating two random values v1 and v2 that are uniformly distributed between -1 and +1. We then compute the value s = v12 + v22. If s>1, we reject the values of v1 and v2 generate new ones... With two good values of v1 and v2, we then compute:
x1 = v1 √((-2 ln s) / s)
x2 = v2 √((-2 ln s) / s)
Where this comes from had me scuttling off to here which describes the Box Muller method. Basically, we want a number between 0 and 1 for Υ and any value in [0, ] for θ. Well, this is just finding uniformly distributed values within the unit circle. Since the process of scattering points uniformly over a box in no way favours nor discriminates between points within a circle within that box, we can be sure that just taking those inner points is also uniformly distributed.

What we want to do is get rid of those expensive sin and cos functions for x and y. We note that

cos θ = adjacent / hypotenuse = v1 / s
sin θ = opposite / hypotenuse = v2 / s

since s2 = r is the length of our vector (as defined in the above algorithm from Java Number Crunchers). We then plug that into our polar co-ordinates equation above, we get the equations for x1 and x2 we see above. QED.

No comments:

Post a Comment