A Theory of Resets

Many years ago I read some papers on the architecture of the phone company’s ESS systems (before SS7 I think). I was impressed that they had a theory of resets that helped organize some ideas on how to build systems with high availability even amid functional upgrades. I cannot now find those documents but I will recount here some of the ideas that led us to adopt persistence for the Keykos design.

The ESS had several reset levels, perhaps four or five. They were in a partial ordering, even a simple ordering in the cases they explained. I do not recall them in detail but here are a few:

Reset calls in the process of being setup. Anyone dialing during this reset will get a new dial tone.
Reset calls. Anyone talking will get a signal indicating that the connection is gone, as if the other party had hung up.
Reset hardware maps. The system would require reloading the data that maps between phone numbers and the physical equipment that served those numbers. Other lesser data was also reloaded. Human intervention was required.

They used terms such as “level three reset” I recall, but Google is ignorant of that phrase. Low level number resets were mild. Such terms came from hardware engineering and such resets could be triggered by various means but after installation the most severe resets were not expected in the lifetime of a typical installation.

Application to Keykos

I took from their theory of resets the idea that application upgrades should reset what was necessary for the task at hand, and no more. Most systems today have a meagre set of resets:

Reboot the system. In an object system this means that all objects must be reincarnated by some process unlike normal object creation.
“Force Quit” in Mac OS X, or “kill” in Unix. Some portion of application state is reclaimed without the help of application logic.

These are crude tools—special forms of reset. Keykos meters can stop portions of the system but retain the option to continue. What portions can be thus controlled is itself part of the application design, just as in the design of digital hardware.

The space bank provides means to reclaim space when the application that allocated it goes berserk.

In practice we found our long lived abstracted states quite robust and soon forgot our fear of state rot in old objects. We had the good fortune to be running on hardware with a mean-time to failure of perhaps a year (it was a main-frame) and well duplexed disk storage. We almost never resorted to a tape checkpoint except to assure ourselves that is was possible. IBM’s VSAM was indeed an abstracted file system that maintained order with early versions of balanced tree mechanisms. I never even heard misgivings that the data was not out in plain view.