I record here an early scheme for solving a large class of differential equations using several real CPU's in shared NUMA RAM.
The IBM 701 had enough memory (4K words) for a whole row of mesh points and only two tape drives were required. The calculation was scaled fixed point binary. The 701 lacked indexing and required self modifying code to pass over an array.
The IBM 704 lost the ability to read backwards and the tape content was divided into two parts on two tapes and one tape rewound while the other served. The calculation was converted to floating point which was new hardware for the 704. Index registers made the code easier to write and faster to run.
The IBM Stretch (7030) had enough memory to hold the entire mesh in core and this simplified the code again.
When I arrived in Livermore in 1955 the Univac code had been running about one year. I had left Livermore before the next development that I report but I got the following information verbally form Chuck Leith.
The same class of problem was run on a BBN Butterfly. Each processor held in its local memory a 2D sub-mesh of the large 2D mesh. These sub-meshes overlapped by one row along both horizontal and vertical slices. This was necessary because computing the next time step always depended on values at neighboring points. There was no logical conflict for each CPU to compute the next time step for each for the interior of its own sub-mesh. When they had all finished they would begin to send messages (copy mesh values) to their neighbors. When all data had been copied the compute could commence again.
A complication in all of these plans was the gradual replacement of a value for one time step with the same value for the next time step. Sometimes the old value was still needed and this led to several error prone code patterns, none of them pretty. There were good techniques for finding these bugs.