Bias in Passwords

<title>Bias in Passwords</title>
It is common place to estimate the strength of a password by multiplying the number of characters by the logarithm of the number of characters in the ‘alphabet’.
This is clearly a crude measure.
“Mississippi” is not a terribly good password and even “thesopsis” is bad because it is euphonic.
<p>I examine a similar mathematically tractable problem here.
You are trying to find S, a 64 bit pattern whose secure hash you know.
You know that S was chosen using an independent but biased bit source which produces zeros with twice the probability as ones.
<p><a href=a.c>This program</a> takes command line parameters n and f.
For an n bit word chosen so that the probability of each bit being one is f, and for each k, 0≤k≤n, shows:<ul>
<li>k
<li>The probability that the word has k ones,
<li>The probability that the word has less than k ones,
<li>The probability that an unbiased n bit word has less than k ones.
<li>Word size for equivalent unbiased search.
</ul>
<pre>gcc -Wall <a href=a.c>a.c</a> -lm
./a.out 64 0.333333333333333</pre>
yields
<pre>  0  0.00000  0.00000  0.00000 -inf
  1  0.00000  0.00000  0.00000  0.0
  2  0.00000  0.00000  0.00000  6.0
  3  0.00000  0.00000  0.00000 11.0
  4  0.00000  0.00000  0.00000 15.4
  5  0.00000  0.00000  0.00000 19.4
  6  0.00001  0.00000  0.00000 23.0
  7  0.00003  0.00001  0.00000 26.3
  8  0.00009  0.00003  0.00000 29.4
  9  0.00029  0.00013  0.00000 32.3
 10  0.00079  0.00042  0.00000 34.9
 11  0.00195  0.00121  0.00000 37.4
 12  0.00431  0.00316  0.00000 39.8
 13  0.00862  0.00747  0.00000 41.9
 14  0.01569  0.01608  0.00000 44.0
 15  0.02615  0.03178  0.00000 45.9
 16  0.04005  0.05793  0.00001 47.7
 17  0.05654  0.09798  0.00004 49.3
 18  0.07381  0.15451  0.00011 50.9
 19  0.08935  0.22832  0.00031 52.3
 20  0.10052  0.31767  0.00078 53.7
 21  0.10531  0.41819  0.00184 54.9
 22  0.10291  0.52350  0.00407 56.1
 23  0.09396  0.62641  0.00843 57.1
 24  0.08026  0.72038  0.01638 58.1
 25  0.06421  0.80064  0.02997 58.9
 26  0.04816  0.86485  0.05171 59.7
 27  0.03389  0.91301  0.08432 60.4
 28  0.02239  0.94689  0.13022 61.1
 29  0.01390  0.96928  0.19087 61.6
 30  0.00811  0.98318  0.26615 62.1
 31  0.00445  0.99129  0.35399 62.5
 32  0.00229  0.99573  0.45033 62.8
 33  0.00111  0.99803  0.54967 63.1
 34  0.00051  0.99914  0.64601 63.4
 35  0.00022  0.99964  0.73385 63.6
 36  0.00009  0.99986  0.80913 63.7
 37  0.00003  0.99995  0.86978 63.8
 38  0.00001  0.99998  0.91568 63.9
 39  0.00000  0.99999  0.94829 63.9
 40  0.00000  1.00000  0.97003 64.0
 41  0.00000  1.00000  0.98362 64.0
 42  0.00000  1.00000  0.99157 64.0
 43  0.00000  1.00000  0.99593 64.0
 44  0.00000  1.00000  0.99816 64.0
 45  0.00000  1.00000  0.99922 64.0
 46  0.00000  1.00000  0.99969 64.0
 47  0.00000  1.00000  0.99989 64.0
 48  0.00000  1.00000  0.99996 64.0
 49  0.00000  1.00000  0.99999 64.0
 50  0.00000  1.00000  1.00000 64.0
 51  0.00000  1.00000  1.00000 64.0
 52  0.00000  1.00000  1.00000 64.0
 53  0.00000  1.00000  1.00000 64.0
 54  0.00000  1.00000  1.00000 64.0
 55  0.00000  1.00000  1.00000 64.0
 56  0.00000  1.00000  1.00000 64.0
 57  0.00000  1.00000  1.00000 64.0
 58  0.00000  1.00000  1.00000 64.0
 59  0.00000  1.00000  1.00000 64.0
 60  0.00000  1.00000  1.00000 64.0
 61  0.00000  1.00000  1.00000 64.0
 62  0.00000  1.00000  1.00000 64.0
 63  0.00000  1.00000  1.00000 64.0
 64  0.00000  1.00000  1.00000 64.0
ultimately: 1.0000000 1.0000000</pre>
From this we can see that we will probably find the answer before we have tested all words with 22 or fewer ones which is only about 0.4% of the 2<sup>64</sup> cases.
We will have the answer with 99% probability after we have searched 26% of possibilities.
If computing costs us anything it is wise to order the search with few bits first, but how do we order the search this way?
The effects are more dramatic for a more biased bit generator.
<p>It may be that computing the secure hash dominates generating the biased words in which case we need not generate the biased words efficiently, but then it is important to do fairly accurately.
<p><a href=b.c>The routing <tt>gen(rep, n)</tt></a> generates all numbers from 0 to 2<sup>n</sup> and calls routine rep for each.
It does it in order of number of ones in the generated number.
The inner recursive routine <tt>rg(p, m)</tt> adds m ones to the right p bits of <tt>t</tt> in all possible ways.
Those p bits are zero before and after the execution of <tt>rg</tt>.
<p>Alas this cute code sheds little light on the original problem of searching for textual passwords while exploiting bias such as euphony.
The two obvious means are:<ul><li>to search all words in some dictionary<li>and then all combinations of letters from whatever alphabet is allowed in the given password system.</ul>