Proofs about Code

Code Proofs

Here are some pointers to ideas about proofs about programs.

Research on Proof-Carrying Code for Mobile-Code Security, Peter Lee and George Necula (CMU).
Infrastructure for Proof-Referencing Code, Carl Gunter (Penn), Peter Homeier (Aerospace Corp), and Scott Nettles (Penn).
Trust Management and Proof-Carrying Code in Secure Mobile-Code Applications, Joan Feigenbaum (AT&T Labs) and Peter Lee (CMU).
Turpin’s PhD thesis, “Programming Data Structures in Logic”
recent interesting comments
Verve at MS

A few years ago I began to develop some ideas on a proof system for code. These were compromises between extremes of earlier ideas and are meant to be contributions to the engineering of proof systems rather than any fundamental ideas. My ideas have never been nearly complete.

I will be concerned here mainly with proofs that programs are well behaved, rather than correct. By well behaved I mean that they respect access boundaries without the need for runtime checking, and that they finish certain tasks in a known time.

Such theorems are typically necessary for a particular program before correctness of any sort can be proven.

By “program” I mean the binary form that the processor obeys—after the compilers, linkers and loaders are done. I anticipate that the checker could run without significant cost after loading of critical code. Obviously some would trust more of the infrastructure.

There are two ideas here:

The format of the induction hypothesis
The nature of the propositions to be proved

I think that these can well be separated but I will let others do that.

The basic idea is to present a plan for a program. The plan can be roughly described as an encoding of the data declarations of the program. Some languages, perhaps most or all, will require information beyond the declarations, however, to build a suitable program plan.

Our design goal is to devise a definition of what it is to be a plan for a program so that:

A small program can quickly check that a given program conforms to a give plan.
A plan can express storage safety.
A plan can express maximum execution time.

Secondary goals are to:

make the plan small
express storage safety for common patterns of compiled code such as used in languages with polymorphism.

To make the plan small we require a degree of automatic proof generation for well defined and truly trivial cases. The proof need not iterate that the semantics of a load instruction are to copy data from memory to a register. The proof checker will know the semantics of the common instructions and need not be reminded. All of this might be achieved by reference to previously proven theorems but this assumes that the technology supports the concept of theorem. I do not yet assume that.

Form of a Plan

A plan is a binary pattern suitable for easy storage in RAM. A proof checker may bring the program and its plan into RAM but that is only the checker’s business. The shape of a plan should make it suitable for there to be a plan for plans and I would like to think that the proof checker could comprehend itself, but that is not really critical—merely elegant. That goal has influenced my design.

We speak of pointers within the plan. That would normally indicate data with addresses within some containing address space. Here a pointer is relative to the beginning of the plan. Alternatively the compiler would produce the plan which would be loaded in its own section along with the code. Internal pointers within the plan would be adjusted as a normal loader function. Note that the loader need not be trusted. The check will fail if the plan is misloaded. If the misloaded plan passes anyway, then the code is safe for reasons not comprehended by the prover.

Content of the Plan

The plan will come largely from data declarations in a fairly direct manner just as the code comes fairly directly from the executable statements in some higher level language. I presume that compilers would be modified to produce the plan much as they now produce debugging information. The compiler may need some extra information from the user probably via the source. This extra information is likely to aid the human reader of the source as well and indeed knowledge by the source reader that the program had compiled and pleased the proof checker, would ease some burdens of humans who must validate the code.

Malloc

Today most programs explicitly allocate storage by calling some routine such as Unix’s malloc. This requires a concept of a part of (free) storage to which there are no live pointers.

Closely connected to this are the several general plans to release storage:

Explicit release: free(p)
Freeing in batches or subpools
Garbage collection.

Each of these present special problems. It would be nice to prove that the general allocation infrastructure code (malloc, gc, etc.) is correct but that is not necessary for a useful technology.

As Induction Hypothesis

Viewed as a proof, the plan presents a coded form of an induction hypothesis about how memory is laid out. It would say that memory holds disjoint blocks of various shapes; each matched to some particular element of the plan to which it conforms. The induction would be over the number of units of code that had been executed. Hopefully such units would be no bigger than basic blocks. I don’t know that it will be necessary for the plan to identify the units.

For instance if the source were the language C and there were a struct declared in that source, then there would be a plan element (PE) that described each field of that struct and the PE for the struct would be an array of pointers to other PEs that describe the successive fields of the struct. The stack frame’s shape would likewise conform to some plan element, one such element for each routine that allocates stack frames.

While the PE for a struct is an array, the PE for an array is a struct, or perhaps more likely, a tagged union of PEs depending on the discipline used to bound the indexes used to access the array.

The above formats of meta data appeared in the earliest PL/I systems to support debugging and safe access to arrays whose size was unknown at compile time.

Compiled polymorphic languages embody an object as a block X of data that begins with a constant pointer to a constant v-table which is shared between objects of the same class. Following the pointer in block X are instance variables. The v-table holds the code address for each class method. It also gives the length of the X blocks of its class. Offsets in this table are fixed in early compilation. As a general call site is compiled, the compiler knows the method and thus the v-table offset. The compiled code must sequentially get the address of block X, and then load address of the v-table, and then load the address of the method entry point.

For polymorphism we introduce a PE, called a QPE (quantified PE) consisting of a prefix and another PE, p₂. The prefix introduces a new type identifier which refers to no particular type. The identifier can appear in p₂ in any context requiring a plan pointer. This is formally like a universal or existential quantifier in predicate calculus. A block of storage conforms to the QPE plan if for some plan z (not necessarily stored anywhere just now) that block would conform to p₂ if a pointer to a plan for z were substituted for each occurrences of x in p₂. Un settled issue: may z occur in plans referenced by p₂? I think so but now the question of the scope of z arises. In predicate calculus scopes are contiguous and nested. We probably need a kindred discipline.

To support polymorphism we introduce a PE which means that for some type x, the following PE describes possible memory blocks for the running program. The following PE may refer to this unknown type. This may be used to form PEs with the following meaning:

For each block of storage described by this PE there is some type x. The block has two words: The first word locates code in the form of as subroutine that takes a pointer to a block whose form is x. The second word is a pointer to data in form x.

A PE in the above form can be used to describe a procedure value (a closure) such as provided by Algol, PL/I or Pascal. The first word locates the code for the procedure and the second locates the specific activation record that holds the values for the free variables in the source for the procedure.