RAID

Here are a couple of ideas that are probably not new but I have not seen them. It may be applied at the slow end of a storage hierarchy. In a storage hierarchy, data ages out to bigger slower memory. It becomes part of a system state which you will preserve for a while in case recovery is needed. Keykos typically kept two copies of all data on disk where checkpoints and permanent data lived. Here we consider more nuanced schemes.

There are failure modes of hardware or even geographic sites which we want to be invulnerable to. The failures that concern us here are a storage device that fails to retrieve data entrusted to it. We assume here that we are not misled with the wrong data; that is a different non trivial but solved problem. As we decide that we have taken sufficient precautions to declare a checkpoint taken, we presumably want not to vulnerable to failure of a single hardware component. Duplicating disk storage on distinct controllers and devices should achieve this. Erasure coding provides both better economy and less hardware vulnerability, at the cost of extra simple processing and transmission activity. This processing and transmission might interfere with important time sensitive work. The raw idea is to merely duplicate as we close in on a checkpoint, and at leisure consolidate this storage into an erasure coded form requiring less space and even less hardware vulnerability.

Separately we can send data to remote sites. Such work falls into the category, work delayed is often work avoided, but that implies that site failure leads to recovery from less recent checkpoint. In extreme cases a quorum of sites can reproduce a checkpoint, and one of them can resume. This will require much data transmission.

We exclude malicious storage devices that contrive to fool us. Thus we assume that when the media of a storage device is bad, it returns nothing or an error message.