Compiling a Functional Language

This page is in terrible disarray but it is where I collect ideas that allow a compiler and computer to compile languages into which functional programming is creeping. This thesis bears on our question. Algol 68 is already a first class functional language in my opinion except for the clause about not letting the addresses of stack values leak to more permanent storage so as to outlive the allocation of those addresses. That means you don’t have pointers in old frames or the heap pointing into expired frames. Other parts of the language implied either garbage collection, or manual and unsafe freeing. There was no standard given for requesting the freeing of heap data. van der Veer’s compiler detects and enforces those rules; the Cambridge compiler did not. The language would graduate to first class by omitting just two or three short exclusion sentences from its definition. The Royal Establishment (RRE?) had a compiler? that supported the liberalized language.

I imagine here a strategy to use the conventional stack for conventional patterns and use the slower more general alternatives only as necessary. My claim is that the compiler can efficiently tell when the fancy stuff is needed. I have not yet proven that claim, however. See some more A68 detail.

This is more about the nature of the compiled code and machine representation of values, than the structure of the compiler. This is a paper by nearly the same title as this page. That paper describes a uniform way to compile functional languages. I won’t repeat much of that. The tricks I explore here presume to compile conventional program patterns conventionally and resort to the tricks of the paper above only in the cases once considered exotic and excluded for thinking them not useful, or for difficulty in implementation. I want to consider a system that mainly uses the hardware that is conventionally used for the stack, and do novel things only when necessary.

First class functions are insinuating themselves into ordinary languages; there is even an effort to add some sort of lambda construct to Java. Here are three language properties, largely orthogonal, that describe the languages I am concerned with:

Anonymous functions (lambda expressions): as in Scheme’s (lambda (x) (+ x b)), OCaml’s fun x -> x + b, Algol 68’s (INT x)INT: (x + b), JavaScript’s function (x){return x+b;}. These lambda expressions are evaluated to produce a function, just as x + b is an expression that is evaluated to produce a number. In neither case is it necessary to name the value, function or number, in order to use it. Each of these languages included lambda from their beginning.
Lexical binding of free variables: Note that the variable ‘b’ in each of the above lambda expressions is undefined and the expression must appear in a context where b is defined in some outer scope in which the expression is embedded. ‘b’ is said to be free in (INT x)INT: (x + b) but not in (INT b = 42; (INT x)INT: (x + b)). By contrast ‘x’ is bound in both. This is terminology that goes back to the 1940’s lambda calculus and the same notions are seen in Frege’s late 19th century logic. Computer language designers refer to these ideas as ‘lexical scoping’.
Persistence of function values as long as they are needed: Algol 68, PL/I and gcc’s C extension support nested function definitions, but such functions generally become invalid when any scope which defines one of their free variables expires, perhaps when the stack frame where the lambda expression was evaluated is retired as the enclosing function returns. In these languages it is impossible for routine X to return a function if that function refers to a variable local to X.

Actually the tricks I seek here work fine for languages without the first property—lambda expressions, but the attractive patterns that the other two properties allow are more syntactically convenient with lambda expressions.

Apple has a clever new addition to C, the block, that provides function like objects that survive the stack frame where they were created. The life time of a block is explicitly managed like other values in C, C++ or Objective C. Here, however, I implicitly assume garbage collection and syntactic lambda constructs such as the above.

While C lacks lambda expressions, the issues apply equally to C and so I will use C, and gcc’s nested function extensions to illustrate the issues. gcc provides neither persistent functions, nor anonymous functions, but it does provide access to free variables. The syntax is merely marginally more awkward.

The Environ

Before we proceed there is an issue to be settled. With recursive routines several stack frames may exist at once for the same function. It is not entirely obvious which frame a free variable should refer to. I will not go into the confusion in detail here but suffice it to say that Church got it right and all the current functional languages that I know of all do it Church’s way. The Algol 68 language definition, for all its faults, introduced the best terminology and imagery in my mind to make clear what frame is meant by a free variable in a function body. That imagery also generally corresponds to the most efficient way, I think, of implementing the necessary semantics. Algol 68 refers to the environ (short for ‘environment’) that is a conceptual runtime construct that defines the current values of all the free variables. Every evaluation of a language expression, including lambdas, is evaluated against some environ. Environs come and go as the computation proceeds and several may exist at once.

(polishing)

digression

In simple languages the stack reifies the environ and we want to use the stack too when we can. The reification of a function is a pair of words, the first points to the code compiled from the body of the function definition; the second points to the reification of the environ.

The environ packages up pointers to all of the stack frames that were accessible as the lambda expression was evaluated to produce the function. This package and a pointer to the code compiled from the body of the function definition is the machine representation of the function. Actually Algol 68 goes one step further and hints that the environ need only package pointers to the frames ‘necessary’ to the function body.

There are common circumstances where the code compiled from the function body can refer directly to a free variable in its parent’s stack frame. Consider:

double ps(int j, double a[j], double b[j]){
   double ip(double x[j], double y[j]) {int k; double s=0;
     for(k=0; k<j; ++k) s += x[j]*y[k]; return s;}
   return ip(a, b)/(ip(a, a)*ip(b, b));}

‘j’ is free in the inner function ip but defined as a parameter to ps. The inner function ip can access j on the stack just as the outer function could using the stack pointer.

This pattern happens when a function f defined within a function g, is invoked during the same invocation of g as f is invoked. (SIC) I will try to say that more clearly. The text that defines f is included in the text that defines g. g is invoked and during that invocation, f is created and invoked. f is not invoked subsequent to the end of the invocation of g. This useful program, described here has routines 3 deep with several instances of free variables at the bottom.

First class values, such as functions here, can be the value of an expression, of a variable, an argument, or of a parameter. These arrangements allow the function to get to other places where the free variables are not in scope, and other times where the stack frame where the variable once lived has been reallocated. Here is a program written in four languages which exploits a convenient hack to instantiate two generators to be run concurrently. The two instances keep their state in a contrived stack frame that persists during the execution. Hacks like this encourage object oriented languages.

The ‘Alternative’ in the Scheme program avoids the hack and provides a routine, mi that generates generators, two of which are generated to do the computation. That inner anonymous function definition has the free variable a which is defined on the scope of mi. mi returns that function each time so that a refers to different initialized storage—we exploit Scheme’s persistent functions.

Fast-- There is still much background to be covered but I put down the ideas here that stimulated this note, before I forget them! The compiler is aware of each free variable within a function body (such as an ordinary function definition or lambda expression) which is declared as a parameter or variable in some larger enclosing function definition. For each such instance we ask if any of the intermediate functions are prone to escape and:

be accessed while the stack frame remains but the stack pointer is deeper,
or be accessed beyond the duration of the stack frames with the referenced variables, whereupon we reallocate those variables to a heap block and include in the stack frame a pointer to that block. The shape of that block is known to the garbage collector, which also knowns shape of each stack frame.

If the function escapes then those variables remain allocated even when the frame that produced the function expires. The idea is that when the compiler notices that a function value might escape from the duration of the current stack frame, and the function body has free variables that are declared in outer scopes.

There comes a time during compilation when the defining occurrences of variables and the corresponding applied occurrences have been identified. At that point in should be possible to decide for each applied occurrence whether it is possible for the address to escape. There are three possible outcomes:

Temporary Leak: ...

Algol 68 provides specific static rules, enforced by some compilers that prevent such escapes where the function is remembered beyond the duration of the invocation where it was produced, at the expense sometimes of preventing safe usage. If a compiler were to resort to allocating such offending variables on the heap, then we would be done. Distinguishing between the two cases above would be an extra benefit.

Suppose we take Algol 68 and replace LOC’s with HEAP’s after they have been exposed in text such as “INT a” as “REF INT a = LOC INT x := ~”. We need not remove the scope strictures for they all involve LOC. Now we have slowed down all ordinary programs but maintained the meaning of programs and allowed many more. If we remove LOC only in the case where the strictures complain, we will not slow down any extant programs and will have allowed the functional ones. I think it is not that simple.

One more thing: We have not said how the code compiled from the function body with the free variables gets access to them. This has been a task since Algol 60. Such a body may have free variables accessing storage in several different stack frames. Those frames will still be allocated in the Algol, PL/I, Pascal family, but their relative positions are not known at compile time. They are not accessible via the stack pointer. They are naturally chained together, however. This is where the ‘environ’ concept shines, following the Algol 68 definition suggests chaining the ‘necessary parts’ of the environ.

In the Algol 68 abstract execution model each block of code ran in an environ which determined the meaning of all the identifiers. An environ was composed of a locale and another environ. (The locale was the current frame and the other environ was the rest of the stack.) When a lambda expression was evaluated it was expressed as a pointer to the compiled code and a pointer to the environ where

(2.1.1.1.e) An “environ” is either empty, or is composed of an environ and a locale.

Almost: Algol 68 explicitly prohibits function values from escaping the environment where they were created in such a way that they could be used after the stack frame where they were born disappears.

Routine inter in this file and explained here, defines within routine range which accesses B which is defined in inter. This access is possible via the stack pointer. Routine fit within inter also refers to three variables reachable on the stack by direct addressing, since the only references to fit are invocations within inter. Sum s defined in inter, is referred to in routine cntrib

There is another matter that might be raised here. Some languages, such as Algol 68, are careful to specify that in a block that creates a scope for the variables declared therein, the storage for that variable is created anew for each entry. Concomitantly such variables evaporate as control leaves the block, not as a basic rule in the execution model, but as a consequence of rules that ensure that references to those variables do not survive as you leave the block.

C	`{int x = cos(z); res = sqrt(x)+x;}`
gcc C	`({int x = cos(z); sqrt(x)+x;})`
Scheme	`(let ((x (cos z))) (+ (sqrt x) x))`
OCaml	`let x = sin z in (sqrt x) +. x`
OAlgol 68	`(REAL x = sin(z); x + sqrt(x))`

The block in the first line does not return a value because of C. JavaScript seems to lack such a construct but the impending version may include it.

{ref int w[10]; for(int i = 0; i<10; ++i) {int j; w[i] = &j;}}

Should put 10 different addresses in the array w. Gcc and clang each put the same address in each array element. Algol 68 refuses the assignment statement quoting a clause that addresses of variables cannot be stored in places that live longer than the variable. Just such variation is demanded however if the natural syntax of conventional languages are to support functional patterns. It would be natural to produce 10 counting objects in C by writing:

typedef void C(void); typedef int R(void);
typedef struct{C *up; R *rd;} ob;
  ob o[10]; for(int i = 0; i < 10; ++i) {
    int c = 0; void inc(void) {++c;}; int r(void) {return c;};
    o[i] = (ob){inc, r};};

This compiles and runs.

gcc cob.c -Wmost -std=c99 -fnested-functions
./a.out

Unfortunately there is only one counter among the objects. The pattern works in Scheme and OCaml however.

too