Agile Java Man: Generating a Gaussian distribution

Firstly, let's consider what a Gaussian is. It can be derived here. The definition Dr Wilson gives is: "Data are said to be normally distributed if the rate at which the frequencies fall off is proportional to the distance of the score from the mean, and to the frequencies themselves". Or, in maths speak:

df = k (x - μ) f(x)
dx

where k is a constant and μ is the mean. Integrating this formula and setting it to 1 (since it's a probability distribution) via integration by parts and integration by substitution, you'll eventually get:

f(x) = e ^{(x - μ)²/2 σ ²}

σ √2π

which is the Gaussian, where σ is the standard deviation.

Now, squaring it and integrating it twice still equals 1 (since obviously 1² = 1) and looks like this:

(1/2π) ∫^∞_-∞∫^∞_-∞  e^-x²/2 e^-y²/2  dx dy

where I've absorbed the constants into variables and used both x and y for each integration. The trick is now to use polar co-ordinates, so:

x = r cos θ
y = r sin θ
dx dy = r dr dθ

and that integral becomes:

(1/2π) ∫^2π₀ dθ ∫^∞₀  r e^-r²/2  dr

which, with another integration by substitution with ψ = r²/2, becomes:

(1/2π) ∫^2π₀ dθ ∫^∞₀  e^-ψ  dψ

and again with Υ = e^-ψ.

(1/2π) ∫^2π₀ dθ ∫¹₀  Υ dΥ = (1/2π)2π x 1 = 1

Where Transformation Sampling comes in

"We don't care about this integral. What we do is we use our relationship between integration and sampling" [Werner Krauth at Coursera] where θ is a random number between 0 and 2π and Υ is a random number between 0 and 1.

So, now we work backwards. If ψ = r²/2 and Υ = e^-ψ, then r = √ (- 2 ln Υ) where Υ is sampled over the interval [0,1]. The easy bit is θ which is simply sampled in the interval [0, 2π].

Plugging these numbers back into the equations for polar co-ordinates, we can now give samples for x and y. Notice that you actually get 2 values (x and y) for the price of one. So, implementations (like java.lang.Random.nextGaussian) maintain state (the next Gaussian value if it's available) which is calculated for "free" - except that the method is synchronized...

Implementation

What we have above is a decent implementation for generating Gaussian values but it's inefficient. There is a nice math-hack to make it better.

Java Number Cruncher (p367) says:

We begin this algorithm by generating two random values v₁ and v₂ that are uniformly distributed between -1 and +1. We then compute the value s = v₁² + v₂². If s>1, we reject the values of v₁ and v₂ generate new ones... With two good values of v₁ and v₂, we then compute:

x₁ = v₁ √((-2 ln s) / s)
x₂ = v₂ √((-2 ln s) / s)

Where this comes from had me scuttling off to here which describes the Box Muller method. Basically, we want a number between 0 and 1 for Υ and any value in [0, 2π] for θ. Well, this is just finding uniformly distributed values within the unit circle. Since the process of scattering points uniformly over a box in no way favours nor discriminates between points within a circle within that box, we can be sure that just taking those inner points is also uniformly distributed.

What we want to do is get rid of those expensive sin and cos functions for x and y. We note that

cos θ = adjacent / hypotenuse = v₁ / √s
sin θ = opposite / hypotenuse = v₂ / √s

since s² = r is the length of our vector (as defined in the above algorithm from Java Number Crunchers). We then plug that into our polar co-ordinates equation above, we get the equations for x₁ and x₂ we see above. QED.

Agile Java Man

Sunday, April 9, 2017

Generating a Gaussian distribution

No comments:

Post a Comment

Blog Archive

About Me