I record here an early scheme for solving a large class of differential equations using several real CPU's in shared NUMA RAM.

The Problem

We have a two dimensional mesh of points with 10 to 20 floating values given for each point for some initial configuration. We are to compute the situation at subsequent time steps for the same mesh given boundary conditions and equations that determine the state of each mesh point as a function of the mesh point states at the previous time step for that point and its immediate neighbors. In one particular verson of the problem there were about 5000 mostly floating commands necessary to update one mesh point. There were relatively few conditional branches in that code.

History

The first version of this program at Livermore ran on the Univac I with 1000 words of memory. The physics was simpler. A mesh point was defined by 10 scaled fixed point numbers. One point and its immediate neighbors would fit in memory. The 2D mesh for one time step was stored on a magnetic tape. The next time step was computed by reading the tape (backwards) and writing the result on another tape. The role of these two tapes alternated for successive time steps. Two more tapes were required since a whole row of mesh points could not be accommodated in memory and each mesh point calculation depended on the two neighboring rows. Those tapes held just one row of the mesh at a time. Only recently did I learn that that program was written by Bryce Dewitt.

The IBM 701 had enough memory (4K words) for a whole row of mesh points and only two tape drives were required. The calculation was scaled fixed point binary. The 701 lacked indexing and required self modifying code to pass over an array.

The IBM 704 lost the ability to read backwards and the tape content was divided into two parts on two tapes and one tape rewound while the other served. The calculation was converted to floating point which was new hardware for the 704. Index registers made the code easier to write and faster to run.

The IBM Stretch (7030) had enough memory to hold the entire mesh in core and this simplified the code again.

When I arrived in Livermore in 1955 the Univac code had been running about one year. I had left Livermore before the next development that I report but I got the following information verbally form Chuck Leith.

The same class of problem was run on a BBN Butterfly. Each processor held in its local memory a 2D sub-mesh of the large 2D mesh. These sub-meshes overlapped by one row along both horizontal and vertical slices. This was necessary because computing the next time step always depended on values at neighboring points. There was no logical conflict for each CPU to compute the next time step for each for the interior of its own sub-mesh. When they had all finished they would begin to send messages (copy mesh values) to their neighbors. When all data had been copied the compute could commence again.

A complication in all of these plans was the gradual replacement of a value for one time step with the same value for the next time step. Sometimes the old value was still needed and this led to several error prone code patterns, none of them pretty. There were good techniques for finding these bugs.