I have been charmed several times with Kolmogorov notion of sending decompression algorithms along with compressed data. Alas these schemes founder on the issue of safely obeying the algorithm. I seek here a simple way to safely obey a class of untrusted programs on a computing platform so as to protect that platform’s integrity. (Snide remark: A reasonable platform would do this trivially for native machine language code but we have no such platforms.) The following is an alternate solution that has interpretive costs but is portable across systems. It might also provide a way in the 22nd century to run some 21st century programs when there is no longer available a committee of 21st century engineers who collectively recall arcane details of the ancient platforms.

I would like to see a computer language with the following virtues:

  1. Binary like the instruction set of a computer,
  2. Easily interpretable by a small C program, like a byte code language,
  3. Within a factor of about 2 space wise with current commercial machine languages,
  4. Within a factor of about 40 speed wise with current commercial machine languages,
  5. A language into which most programming languages can be compiled,
  6. A language that can be translated with moderate efficiency to the machine language of popular CPUs,
  7. Clear semantics in a few pages of English,
  8. Semantics that prominently lack the ability to damage the platform,
  9. Some extremely circumscribed way to deliver computed results.

Other standard options may be available; as explicitly allowed by the operator at run time:

There is a fundamental issue of garbage collection. I see perhaps an unavoidable dichotomy. Programs can serve many purposes without garbage collection. I think languages that are memory safe and garbage collected are possible, and perhaps have even been constructed, but I aim here at something much simpler where the protection comes at the boundary of the sandbox. One may build garbage collection within, but platform integrity does not rely on its correctness. Those applications where GC must really run often are difficult in this plan. Bringing your own GC code is a heavy solution. A few standard GCs are possible. Designing a language like Java byte codes solves the problem but badly blows my complexity budget. Also Java’s GC is pretty good but doesn’t suit the needs of all languages. Language safety can be used to solve a significant class of integrity protection problems, but this project is predicated on limiting damage by limiting programmed mutation to the sandbox.

There is an interesting dichotomy between languages, be they machine languages or higher level languages: Is there a difference between an integer and an address? C provides a desultory distinction in its type system, but invites conversion (and corruption) at the drop of a cast. C conspicuously declines to check array bounds and there is no way for a routine to learn the bounds of an array parameter. Contrariwise capability hardware architectures provide ‘addresses’ in the form of capabilities which the hardware carefully distinguishes from integers. There integers are limited to array indexes which are assiduously checked.

This site pushes the capability distinction yet here I tentatively propose C’s promiscuity because it allows a class of patterns that classic languages support only with special constructs, if at all. For simplicity we put our trust solely in the sandbox mechanism.

The first plan would be to provide a simple trusted interpreter which conflates integers and addresses in the logic of the executing program relative to a fixed array in the interpreter’s memory called the sandbox here. This much would suffice for many purposes and for tasks that are not feasible today when we need to protect the environment where the program runs. Simple semantics and a simple interpreter seem sufficient.

Implicit in the above is the assumption that the program is responsible for the bit-by-bit layout of its sandbox: the semantics goes down to the bytes in the sandbox and their addresses therein.

I would include parts of the IEEE floating point spec by reference, but only those parts that prescribe how to combine one or two numbers to produce another.

The language must prescribe an endianness. I would prescribe big-endian but there is a problem with either choice when the host is of the other endianness. There is a trick to ameliorate the problem. Suppose that the language is big-endian and the host is little-endian. The sandbox that serves as the guest’s memory is stored backwards. Guest address 0 is kept in the byte whose host address is the largest. Guest addresses are negated before they are used to access the sandbox. We illustrate a 224 byte sandbox here. If the guest program fetches the 4 byte integer with guest address 0x08, the interpreter will negate this number to get 0xfffff8. The host interpreter loads the 4 bytes at offset 0xfffff8-4 = 0xfffff4 in the guest memory which retrieves the 4 bytes in their correct order. To fetch the 8 byte operand at 8 the host fetches offset 0xfffff8-8 again ready for immediate host data manipulation commands such as add. No byte swapping commands are needed.

Some program loading is required at the host end so that common libraries need not be shipped for each small guest program. The loading program is not in the security kernel except in that it must do nothing other than write in the sandbox. The host has the option of precompiling common libraries to native machine language but this expands the trusted code and increase the attack surface of the system.

This plan is reminiscent of the 1957 notion of a program. The instruction set for the IBM 704 would largely satisfy these requirements but a few substantial improvements can be made without a net complexity increase.

I would require the semantics to be deterministic and depend solely on input, program and platform identity. Floating point variations can be accommodated by allowing platform identity dependency.

Tentative

The operand stack is simple and fast enough. Perhaps a 4 byte alignment requirement will make it fast enough for most native interpreters. On the other hand I like patterns available to machines with 16 or 32 registers.

Here is a crude scheme that allows the interpreter to dynamically translate to native machine language and also provide some set of precompiled services while efficiently abstracting the translation artifact. Posit a program settable register, fence, above which the program can not store. If a program wants the advantage of translation, it puts the code to be translated at a high address and sets the fence to write protect it from itself. A special branch is a clue to the interpreter that the target code is suitable for translation. If the interpreter is willing and able it considers if the target code is write protected and then produces native code outside the sandbox whose execution has the same meaning as the target code. If the program should increase the fence to unprotect the code then the translation is discarded. This fence can also be used for certain garbage collection tricks. An untrusted loader can supply pre-translated standard services. This increases the attack surface of the host platform but this mechanism is the platform’s option. The fence register serves only to maintain reproducibility. Otherwise we have the indeterminacy of systems with ‘purge instruction cache’ commands. I think this is not a security issue.

Comparisons

There are many interpretive languages now but none of them that I know provide virtue # 8. This is clearly slower than Google’s NaCl but it works on other platforms. NaCl has some of the same problems and we should learn from NaCl.

Java byte codes may lack virtue 2, but definitely lack virtues 7, 8 and 9. LLVM lacks 7, 8 and 9.

I know of no interpreters that start from the premiss of no external effects, and then organizes the specification by enumerating external effects that the operator can elect. For instance if the program’s purpose is to leave a new file behind, the name of that file should be specified by the operator!

Specifics

I really should consider the binary LLVM semantics to ease translation from LLVM. Consulting this document makes me think that the ‘LLVM assembler language’ is at a higher level than I want. I understand that the complexity there may allow a better match between program intent and machine instructions, but I want more simplicity.

ARM?

ARM, ARM 2000, ARM ARM, ARM gcc, (Google “gcc ARM Mac”)
No Go: Mac ARM gcc

ARM has condition codes. C does not provide access to native condition codes and host condition codes often do not match guest condition codes. Condition codes are probably good engineering but bad for simplicity. Some architects deprecate condition codes.

Current Candidates

I have looked at ARM and AVR instruction architectures. They both have somewhat complex instruction semantics documents. In both cases much of the complexity stems from flexibility (a family of architectures) and a simple definition might result from removing some flexibility. I have not done the work to verify this.
Java
too big; claims to provide native write access to files.
ARM
Somewhat complex semantics.

Questions I think that Google’s Native Client scheme gets leverage from the fact that vetted code can call local native trusted code with no overhead. This combination may result in much better performance because the trusted code need not conform to the awkward NaCl rules, and also because the trusted code need not be transmitted over the wire. I don’t know yet if there is a simple way to get that advantage in my scheme. A square root routine would be naturally provided by a client supplied routine, but other library routines become problematic. Part of the advantage of this project is to protect one from exploitable bugs in the native libraries. A compromise would be to optionally compile library routines into the new language to be dynamically linked with the visiting code. This linker might itself be implemented in the new language.
Enough procrastination! I layout here a circa 1960 instruction set, simpler even than those of 1960. I defer issues of instruction size and size of fields within an instruction.

Architected state: bytes in sandbox, 2r 64 bit general registers. 64 bit PC.

Instructions accumulated here as we design. In the following d is some integer near 12 and r is some integer near 5. These values are probably to be fixed before design is finished. With respect to a particular system state such a field yields an address which is the sum, modulo 264, of the register and offset.

In the following, ‘address’ refers to a binary number produced by program logic identifying an offset in the sandbox where an operand may be found or deposited. An address field within an instruction is an offset of d bits and a register designation of r bits.

Notes These ideas may complement this endeavor.