Someone selects an integer, K, from the uniform distribution between 0 and 100 inclusive. He prepares an urn with K black balls and 100−K white balls. We do not learn the value of K. We select a ball from the urn and find that it is black, which we expected with probability 0.5 . We replace the ball and stir to simplify later considerations. What is the probability that the next ball will be black?
Lets imagine that the urn preparer prepared 101 urns instead, one for each value of K, and selected a random urn for our experiment. This does not change the math and allows a convenient style of argument.
There are 10100 balls now in the collective urns and the first ball we picked is equally likely to have been any one of these 10100 balls. There are as many white as black balls which means that the probability that the first ball was black is ½ as we presumably expected.
We postulated that K was selected from a uniform distribution and this serves as the a priori distribution. We thus say that the probability that the urn has k balls = Pk = 1/101. We now have a data point—one black ball—and it is time to compute the a posterior probability Pbk for our having urn k.
The black ball we picked is equally likely to have been any of the 5050 black balls in the 101 urns and thus our probability of having picked some particular one is 1/5050. There are k black balls in urn k and thus the probability, Pbk, that we have urn k is k/5050. We have thus used the presumption that each urn was a-priori equally likely.
The expected number of black balls in our urn is now the sum of the number in urn k times the probability of our having urn k.
E(# black balls in our urn) = ∑Pbkk = ∑k2/5050. = 338350/5050 = 67.
In each case the summation is for k from 0 to 100.
Before any sampling our distribution over the urns was Pk = 1/101. Upon finding one black ball it became Pbk = k/5050. Had we instead seen one white we would have had Pwk = (100−k)/5050.
With conventional notation P(x|y) indicates the probability of x being true given that y is. Thus P(first sampled ball is black) = 0.5 but P(second sampled ball is black | first sampled ball is black) = 0.67.
By f(w,b,x) we denote the distribution of urn densities after having sampled w white balls and b black balls. x is the density of black balls. To recapitulate the discrete case above we have: Before the first sample we have an a-priori distribution of f(0,0,x) = 1 which ascribes an equal probability to all urn densities. After one black ball the new distribution is f(0,1,x) = 2x. Note that we normalize so that the integral from 0 to 1 of the distribution function is 1. Likewise f(1,0,x) = 2 − 2x.
f(j, k+1, x) = x f(j, k, x), except for renormalization.
Dwight’s Tables of Integrals gives the definite integral from 0 to 1 of xk(1−x)j as (j! k!)/(j+k+1)!.
With renormalization we have in general
f(j, k, x) = ((j+k+1)!/(j! k!))xk(1−x)j. The same result from Dwights lets us compute the first moment to get P(next sample will be black) = (integral of x f(j,k,x))/(integral of f(j,k,x))
= ((j! (k+1)!)/(j+k+2)!)/((j! k!)/(j+k+1)!)
This gives a probability for black of 1/2 on the first draw and 2/3 after drawing a black ball.
The value of x producing the maximum of f(w,b,x) is b/(w+b) for w, b > 0. The maximum likelihood estimates diverge a bit for small values of b and w. For instance after one white and two blacks Bayes would judge the density to be 60% while the maximum likelihood estimates 67%.
See another continuous exercise.
Bayes introduced these ideas about 1740 and Laplace elaborated on such formulæ in 1812 in his Théorie Analytique des Probabilités. They are controversial in part because they require a priori probabilities that seem entirely arbitrary to some.