Being Corrupted by What You Read

When a program suffers buffer overrun it is a bug in the program. Such bugs are usually due to a C or C++ language design feature that favors efficiency over safety. C program can check for this but it usually requires extra source code and execution time. The specifications of strcpy illustrate the pitfall. This is merely the most famous of forms of malicious input to programs. In theory any program can protect itself if it has a clear specification of what it is to do, or at least if it has clear concept of what it is not to do. Are there any general ideas here to help beyond advice to be careful?

I am not versed in any of schemes described below, but some incarnations use shared runtime software, or automatically compiled code to parse incoming bit streams safely and thus overcome a category of bug into which buffer overruns fall. Shared software must still be debugged but the cost is also shared. There is common support built into browsers for JSON and XML.

These higher level formats address most low level problems of corrupt data. Higher level problems remain.

Byte codes are input to the JVM program. They may be preprocessed by a byte code verifier, which seems to have a clear charter ruling out direct corruption of the interpreter. Still the Java language semantics suggest that some programs, expressed as byte codes, will write arbitrary files whose corruption the JVM is likely be vulnerable to.

A for Andromeda is the highest level malign scenario that I am aware of.

Message Format Specification Languages

ASN.1 is a collection of tools built around a formal means of defining binary data formats which are intended to be used in messages between computer programs. These messages are bit streams and not blocks of RAM. Such messages are normally transmitted over communication links or stored in files.

There is (or could be) a binary form of format spec which would instruct a common runtime library to parse input or marshall output. There is (or could be) a translator from the formal specs to produce code to parse or marshall ASN.1 data. I believe that such tools exist but are not freely available.

Google’s Protocol Buffers seems to provide what I see of value in ASN with much less conceptual overhead. There is a file format for something like message format declarations. The declarations provide types of fields in the message and programatic names of the fields if you use their tools to produce access classes for the messages. The names are not in the messages which are pure payload. A .proto file together with a conforming message could be turned into xml. I havn’t seen a program to do this, nor a standard map.

It is a bit like Apple’s plist format. A plist message (file) contains the names and structure along with the payload. The plist file is much more compact and fast to access than XML, however. It is not even a string of printable characters. The plutil shell command makes it look like a plist is compressed XML except there are many XML constructs that it rejects.

See xPL.