To produce random samples from some distribution over the reals one seeks first to find the integral of the PDF (probability density function), (e−x2 in our case), which will be monotonic increasing from −∞ to ∞ and go from 0 to 1 in that range. This is called the Cumulative Distribution Function (CDF). The inverse of CDF, fed a real chosen from U(0, 1), the uniform distribution on the unit interval, will produce the desired sample. That function, erf, and its inverse have no known simple formulation however and so we try something different which is known as Marsaglia’s polar method.
While ∫[x=−∞:x]e−x2dx has no simple representation, ∫[x=−∞:x]xe−x2dx does. This was probably first noticed when you differentiate e−x2 to get −2xe−x2. Thus we have ∫[x=0:∞]xe−x2dx = ½. We will use this but meanwhile...
The functional equation f(√(x2 + y2)) = f(x)f(y) is solved by f(x) = e−x2. With this we can consider a Gaussian distribution in the 2D plane with probability density e−(x2 + y2) = e−r2 = (e−x2)(e−y2). We consider the integral of that function over the entire plane. This can be expressed either as:
Now we consider the algorithm to produce random normal deviates. We choose a sample from the plane with the afore mentioned probability distribution and deliver each of its two coördinates as separate and independent samples of the normal distribution. First we choose a point in [−1, 1]2 but discard it unless it is also in the inscribed circle. This gives a point whose θ is uniform in [0:2π] and whose radius probability density is proportional to r in [0, 1]. We know the radius by having computed r2 = s = x2 + y2. Can we find a function g that maps [0, 1] to [0, ∞] so that the point (xg(s), yg(s)) serves as the sample from our 2D distribution?
Terminology: If a sample v is taken from a distribution whose PDF is f then the probability that x≤v≤x+dx is f(x)dx. Here x and v range over the reals and ∫[−∞:∞]f(x)dx = 1. (For reference, λx(π−½(e−x2)) is the PDF a normal distribution.) The CDF of a distribution is the function c such that the probability that sample v from the distribution has probability c(x) of being less than x.
The PDF for r is λr2r in [0:1] and λr0 outside that range. The CDF for r is the integral of its PDF: λr(r2) (and λr1 for r>1). The CDF for s = r2 is λr(if r≤0 then 0 else if r≤1 then r else 1). The PDF for s is its derivative: λr(if r≤0 then 0 else if r≤1 then 1 else 0). If you find this confusing and surprising, it confuses me too and warrants this corroboration.
On [0, ∞] the CDF of 2r e−x2 is e−x2. To choose a sample from this distribution we can solve for r in e−r2 = u where u is from the uniform distribution on [0, 1]. It turns out that s is uniformly distributed on [0, 1] so we use s as u. This gives r = √(−log(s)). The length of the vector (x, y) is √s and to get the vector of length r we must multiply (x, y) by r/√s = √(−log(s))/√s = √(−log(s)/s). The final two independent random samples from the normal distribution are xm and ym where m = √(−log(s)/s). The pesky factor √2 comes from the fact that we started with e−x2 whose variance is ½ instead of the standard 1. Use m = √(−2log(s)/s) to get samples with variance 1 as from the standard distribution.
I find both the pedagogy and precision of the above very poor. I would not believe it without the many tests I have tried on the resulting algorithm. I don’t find the Wikipedia article much better.