Memento

Suppose that I want to remember a secret, such as a 100 bit number, for several years and that I do not want to write it down inside a computer or out. For instance it might be the seed of my private key. I will suppose here that I can keep a reliable record somewhere but am unable to keep that record secret. Web sites that require a personal password frequently record other obscure information, such as mother’s maiden name, as an alternate identification means. I propose here a similar but more secure scheme based on that insight.

When we are young we learn quite a few things that we probably will not forget. Many of these are known only to ourselves. I think that there is no one alive, nor any record of the name of my first dog or my first cat, yet I remember those names well. Such memories may provide sufficient entropy for a long lived secret.

There are two parts to the scheme: The early part is when I form the record, and the late part is when I use the record and my memories to retrieve the secret.

In the early part I collect such memories arranged as an unordered set of keyword-value pairs. I select a threshold, t, which is a number between 0 and 1 and that is the fraction of the memories that must be combined with the record to recover the secret. I thus select at the beginning a tolerable amount of memory loss. I record the keywords in the record and, of course, omit there any clue as to the values. The Shamir secret sharing algorithm (SSS) can be used during the second part to recover the entire entire secret from the record and a sufficient portion of the memories.

The Shamir algorithm has two phases: The early phase produces several shares from the secret and the late phase combines a subset of the shares to recover the secret. Each share is the same size as the secret.

Here is how we adapt SSS:

The early algorithm takes the keyword-value pairs, enumerates them, and includes an enumerated list of keywords in the record. It next invokes the early portion of SSS providing:

The secret that is to be preserved,
The number of fixed size hunks into which the memory values will be divided, see below
The quorum size which will later be required to recompute the secret.

The routine will return a series of “secret portions”, a subset of which the late phase of SSS can use to recover the secret. Each portion is the same size as the secret.

Our routine now combines these portions, perhaps with exclusive or, with the enumerated secret values broken into fixed size hunks. Each hunk is hashed to a bit string the length of the secret. These combined values are then included in the record.

It is important that each of these hunks carry about the same entropy if the quorum logic is to be effective. For instance if 10 right answers constitutes a quorum and there are 10 questions each providing about 1 bit of information, then the attacker has a trivial job. An estimate must be made of the entropy in each answer, either by the software or by the user. Questions with high entropy answers might be divided into more than one hunk.

For recovery of the secret, the record is used to prompt for most of the memories, producing the same values which can once more be divided into the same hunks, recombined with the values in the record to retrieve the secret portions from the early phase of SSS. These can now be delivered to the late phase of SSS which will then return the original secret.

There is a delicate pitfall to be avoided here that requires some careful analysis that I will not do just now. You probably want at least 90 bits of entropy to guard against offline attacks. Some of the questions, which are available to the attacker, will suggest only a few bits of entropy in the answer, often just one. If there are too many such questions and each forms a hunk, the attacker may form a quorum by guessing answers to such simple answers. One protection against such an attack is to combine simple questions into a single hunk, and do this in such a way as to make each hunk contain nearly the same amount of entropy. This may require the late phase to combinatorially guess at answers that are forgotten. The attack program has thus much in common with the legitimate late phase program.

Here is a record of memories that I think no one else shares, and the estimated entropy:

7, My first cat’s name
6, My first dog’s name
8, My neighbor’s dog’s name
2, The color of my neighbor’s dog
9, What article my father found under the pampas grass
5, The name of my nemesis in school
5, name of teacher who thought I should not study symbolic logic
4, For what did I install a control in the Whittier house
3, What was the make of our car with a rumble seat
4, What was the make of the car with an accurate clock
5, between which two rooms did we install a door in the Reseda house
9, what game did Bill Lindley and I play on the board in the Math room in PCC
9, what Russian phrase did a Berkeley professor ask me to translate?
6, what was his name
2, what tree(s) grew to the west of the Reseda house
2, what tree(s) grew to the south of the Reseda house
4, what plants grew in profusion 100 feet north of house
2, which side of the house had sprinklers built in
4, two kinds of grapes we grew
3, kind of plants to south of Whittier house
3, kind of plants to west of Whittier house
3, kind of plants to 40 feet west of Whittier house
5, What teacher forbad taking notes
4, what teacher caused me to change schools
4, what math teacher first assigned homework in French
5, what Cal Tech professor found a bug in my math proof
4, what was it supposed to prove (shortest noun phrase)
3, what was stolen from my office in Livermore

An improvement on the above is to ask which sides of the house had sprinklers. That might be worth nearly 4 bits. Which subset of about 12 enumerated plants might be worth nearly 12 bits and easier to recall.

I have just heard of a test that might serve the same objective as the above scheme. It has the advantage of not requiring prior planning. I don’t know how reliable it is. The test has a name which I forget.

The question is of authorship, as contrasted to plagiarism. A person who wishes to claim authorship submits to the following test: The text in question is automatically modified to replace every 5th word by a blank. It has been observed that the real author is able to replace most of those words correctly while another person, even if he is familiar with the same subject, cannot.

To provide the same function as above some 1000 or so words of text composed by the principal are chosen. Every 5th word is hashed to about 2 bits which yields about 400 bits. These bits are submitted to a error correcting scheme that can recover after 100 of the 400 bits have been corrupted. The bits are used to produce the private key.

See this for a crude description of an attack on this scheme.