Having recently come across the Charles Babbage Institute web page and thence the Cray Research Virtual Museum (now here), I decided to jot down some notes from memory here about the insides of some of those computers.
Here is more good information on the subject.
When a programmer thinks of a computer s/he has known it is likely in terms of bits per word rather than color of main frame.
In those days compilers did not insulate the programmer from such details.
Here are some interesting machines from other manufacturers.
I use the ambiguous gender purposefully here.
In the 50’s and 60’s many important contributors to the new art were women, both in hardware and software.
Often the women were not as well known.
CDC 1604
The 1604 was the first of Cray’s machines that I became aware of.
The machine had 32K of 48 bit words.
It had 48 bit floating point, two 48 bit registers in the classic style and six 15 bit index registers.
Here is Seymour’s diagram.
The machine was one’s complement.
There were two instructions per word.
Here is the programming manual.
The machine supported interrupts with the following wart: Upon interrupt the hardware reported the address of the instruction being interrupted.
That address identified only the word holding the instruction.
The hardware remembered whether the first instruction of that word had been executed.
This memory was inaccessible except to the mechanism that resumed after interrupts.
This made nested interrupts or switching contexts upon interrupt excessively arcane.
This is about the only Cray wart that I recall.
I think that “channels” kept their state in low core.
The state was word count and word address.
CDC 160A
After the 1604 arrived we got a 160A.
(Here too.)
I subsequently learned that Seymour had used the 160 as a test bed for 1604 ideas.
Indeed there were many similarities.
Others report that the first 160 was built from reject transistors unsuitable for their original analog purpose.
In any case the machine was extremely elegant in its simplicity.
The programmer could always find some quick way to do something simple but usually there was just one such way.
This orthogonality extended to the 6600 PPUs.
The 160A had 12 bit words and a 12 bit one’s complement accumulator but no multiply or divide.
There were no interrupts(?).
There were two banks of 4K words each and some sort of “small address hack”.
There was a full complement of instructions that indirected thru low core.
There were also two word instructions that included a displacement to be added to the low core content to form the effective address.
This was almost as good as index registers.
The 160A had several unusual IO devices attached and served at Livermore for media conversion.
Another use of the 160A was to emulate the 6600 PPUs which also had 12 bit words.
In this case the small address hack worked to the advantage of the application as the second bank was used exactly to model the 4K words of the PPU.
There was a political aspect to the fact that the machine came in the form of a desk at which the programmer could comfortably sit and debug a program.
Other machines were so expensive as to presumably warrant a style of debugging where the programmer would spend hours at his (non-computer) desk to solve a mystery that might have been solved in minutes with hands-on access to the computer.
This was the milieu leading up to timesharing.
CDC 3600
This was not designed by Cray but it was (sort of) upwards compatible with the 1604.
The data formats and addressing patterns were compatible.
The addresses were still 15 bits with some small address hack allowing access to 218 words.
The machines had some sort of flexible configuration ability that allowed a pool of such machines to dynamically (with the help of an ever present operator) reallocate a bank of memory from one machine to another.
Livermore’s compiler group had their first real success with the 3600 machine, as I recall.
The 3600 was a cautious extension of the 1604 but it was rather useful.
See Music for the 3600.
The 6600 was the first machine to be delivered that I know of that meets the common definition of RISC.
The program specifically arranged for concurrent execution of instructions.
It had eight 60 bit X registers holding either floating or fixed (one’s complement) words, 8 A registers and 8 B registers each holding 18 bits values.
When an A registers was loaded with an address, the corresponding X register would be loaded from that address (or stored if it was A6 or A7).
There were 15 and 30 bit instructions packed in the 60 bit memory words.
15 bit arithmetic instructions referred to three registers.
Some 15 bit instructions caused loads or stores at addresses specified by A or B registers.
The 30 bit instructions included an 18 bit displacement and caused loads or stores.
I quote from the programming manual here:
- An instruction is issued to a functional unit when
- The specified functional unit is not reserved.
- The specified result register is not reserved for a previous result.
- Instructions are issued to functional units at minor cycle intervals when no reservation conflicts (above) are present.
- Instruction execution starts in a functional unit when both operands are available (execution is delayed when an operand (s) is a result of a previous step which is not complete).
- No delay occurs between the end of a first unit and the start of a second unit which is waiting for the results of the first.
- No instructions are issued after a branch instruction until the branch instruction has been executed.
The branch unit uses
- An increment to form the go to K+Bi and go to K if Bi … instructions, or
- The long add unit to perform the go to K if Xi
… instructions
in the execution of a branch instruction.
The time spent in the long add or increment units is part of the total branch time.
- Read central memory access time is computed from end of increment unit time to the time operand is available in X operand register.
Minimum time is 500 ns assuming no central memory bank conflict.
An additional limitation was that 60 bit instruction words could be fetched at a maximum rate of one per 800 ns.
Loops confined to two instruction words (up to eight instructions) were not limited by this.
There were two multiply functional units with 1000 ns latency.
There was one divide unit.
The floating add unit was 400 ns, yielding an unnormalized sum which could be normalized by a subsequent instruction.
Livermore found that most floating adds needed to be normalized.
Central memory was 32 boxes each of 4K 60 bits words for a total of 217 words.
The cycle time was 1 microsecond and the memory bus cycle time was 100 ns like the clock of the rest of the machine.
The Peripheral Processing Units were novel in that the ten of them shared execution hardware.
I recall John Cocke broaching this idea about 1961 within IBM but there was no hardware built then.
Each processor had virtually an 18 bit accumulator, a 12 bit instruction address, and a three bit instruction phase register that was not really visible to the programmer.
Each processor had its dedicated box of 4K 12 bit words.
There were 12 “channels” each with its own 12 bit data registers that could either drive or be driven by its own external cable.
One PPU instruction was able to either send or receive a block of 12 bits words over one of these channels at up to one word per microsecond.
While this instruction was executing, the instruction counter was stored in location 0 in core while the accumulator remembered the core address for the next word to be transferred.
The erstwhile program counter counted the remaining words.
There was also shared hardware allowing a PPU to move data between central memory and the PPU’s private memory.
Operating systems dynamically allocated these PPUs to the tasks at hand.
IO devices were permanently attached to some cable which in turn was permanently attached to one of these channels.
The instruction set was reminiscent of the 160A computer, except for the 18 bit accumulator.
A particular PPU would use the shared execution hardware for 100 ns of each microsecond.
The system logic clock was 100 ns.
Instructions typically took 1, 2 or 3 microseconds, and just as many core cycles.
The hardware provided no semblance of interrupts for the PPU’s; the PPU program had to maintain vigilance over the I/O operations that it was responsible for.
There was no DMA like function except that the PPU program could devote its PPU to moving data between the I/O device and its own core box for large blocks of data.
Such a transfer could reach a rate of one 12 bit word per μsec.
A PPU could, of course, read and write its own 1 μsec core box.
The PPU could read a 12 bit clock that incremented each microsecond.
There were PPU ops to move blocks of PPU core to or from shared central CPU memory.
The PPU program specified both memory addresses.
The data rate was one 12 bit word per μsec.
This data path was shared between PPUs and the aggregate read rate was 60 bits per μsec in each direction.
A PPU could read the CPU’s instruction counter.
It could perform an “exchange jump” on the CPU specifying a central memory address.
This would cause the CPU to stop and deposit its state in central memory and at the same time pickup a new state from the same locations (using split cycles), and then resume execution.
This state included the program visible state and also a bounds and offset in central memory that modified each access to central memory by the central CPU’s program.
Many details,
Seymour’s interlocks