To produce random samples from some distribution over the reals one seeks first to find the integral of the PDF (probability density function), (e^{−x2} in our case), which will be monotonic increasing from −∞ to ∞ and go from 0 to 1 in that range.
This is called the Cumulative Distribution Function (CDF).
The inverse of CDF, fed a real chosen from U(0, 1), the uniform distribution on the unit interval, will produce the desired sample.
That function, `erf`, and its inverse have no known simple formulation however and so we try something different which is known as Marsaglia’s polar method.

While ∫[x=−∞:x]e^{−x2}dx has no simple representation, ∫[x=−∞:x]xe^{−x2}dx does.
This was probably first noticed when you differentiate e^{−x2} to get −2xe^{−x2}.
Thus we have ∫[x=0:∞]xe^{−x2}dx = ½.
We will use this but meanwhile...

The functional equation f(√(x^{2} + y^{2})) = f(x)f(y) is solved by f(x) = e^{−x2}.
With this we can consider a Gaussian distribution in the 2D plane with probability density e^{−(x2 + y2)}
= e^{−r2}
= (e^{−x2})(e^{−y2}).
We consider the integral of that function over the entire plane.
This can be expressed either as:

- Cartesian Coördinates
- ∫[x=−∞:∞]∫[y=−∞:∞](e
^{−x2})(e^{−y2})dx dy = (∫[x=−∞:∞]e^{−x2}dx)(∫[y=−∞:∞]e^{−y2}dy) = (∫[x=−∞:∞]e^{−x2}dx)^{2}. - Polar Coördinates
- ∫[r=0:∞]∫[θ=0:2π]re
^{−r2}dθ dr = (∫[θ=0:2π]dθ)(∫[r=0:∞]re^{−r2}dr) = 2π∫[r=0:∞]re^{−r2}dr = π.

Now we consider the algorithm to produce random normal deviates.
We choose a sample from the plane with the afore mentioned probability distribution and deliver each of its two coördinates as separate and independent samples of the normal distribution.
First we choose a point in [−1, 1]^{2} but discard it unless it is also in the inscribed circle.
This gives a point whose θ is uniform in [0:2π] and whose radius probability density is proportional to r in [0, 1].
We know the radius by having computed r^{2} = s = x^{2} + y^{2}.
Can we find a function g that maps [0, 1] to [0, ∞] so that the point (xg(s), yg(s)) serves as the sample from our 2D distribution?

Terminology:
If a sample v is taken from a distribution whose PDF is f then the probability that x≤v≤x+dx is f(x)dx.
Here x and v range over the reals and ∫[−∞:∞]f(x)dx = 1.
(For reference, λx(π^{−½}(e^{−x2})) is the PDF a normal distribution.)
The CDF of a distribution is the function c such that the probability that sample v from the distribution has probability c(x) of being less than x.

The PDF for r is λr2r in [0:1] and λr0 outside that range.
The CDF for r is the integral of its PDF: λr(r^{2}) (and λr1 for r>1).
The CDF for s = r^{2} is λr(if r≤0 then 0 else if r≤1 then r else 1).
The PDF for s is its derivative: λr(if r≤0 then 0 else if r≤1 then 1 else 0).
If you find this confusing and surprising, it confuses me too and warrants this corroboration.

On [0, ∞] the CDF of 2r e^{−x2} is e^{−x2}.
To choose a sample from this distribution we can solve for r in e^{−r2} = u where u is from the uniform distribution on [0, 1].
It turns out that s is uniformly distributed on [0, 1] so we use s as u.
This gives r = √(−log(s)).
The length of the vector (x, y) is √s and to get the vector of length r we must multiply (x, y) by r/√s = √(−log(s))/√s = √(−log(s)/s).
The final two independent random samples from the normal distribution are
xm and ym where m = √(−log(s)/s).
The pesky factor √2 comes from the fact that we started with e^{−x2} whose variance is ½ instead of the standard 1.
Use m = √(−2log(s)/s) to get samples with variance 1 as from the standard distribution.

I find both the pedagogy and precision of the above very poor. I would not believe it without the many tests I have tried on the resulting algorithm. I don’t find the Wikipedia article much better.

OCaml version; Algol68 version; C; Scheme