I am a rank amateur when it comes to circuits but I recall this idea that will seems to some as nearly obvious yet in need of recording. How do you do a lot of multiplying, say 53 bits times 53 bits? I assume at the outset carry save add (CSA). That page also describes a scheme with which this idea is compatible, and which provides a factor of three over the naïve multiplier scheduling. CSA costs two bits per bit. I double that again for 4 bits per bit by computing each signal and its negation. Each of these signals drive just 4 legs of either an or, and, nor or nand. Which of these mainly depends on chosen circuit family. I think it is obvious that none of these signals travel far.

For a 64 by 64 bit multiply we must do the following 22∙64 times:

We assume both a signal and its negation and we must produce both as well thus:
           A⊕B⊕C  = A∧B∧C̅ ∨ A∧B̅∧C ∨ A̅∧B∧C ∨ A̅∧B̅∧C̅
         ¬(A⊕B⊕C) = A∧B∧C ∨ A∧B̅∧C̅ ∨ A̅∧B∧C̅ ∨ A̅∧B̅∧C
  A∧B ∨ A∧C ∨ B∧C = A∧B̅∧C̅ ∨ A̅∧B∧C̅ ∨ A̅∧B̅∧C ∨ A̅∧B̅∧C̅
¬(A∧B ∨ A∧C ∨ B∧C)= A∧B∧C ∨ A̅∧B∧C ∨ A∧B̅∧C ∨ A∧B∧C̅
‘⊕’ is exclusive-or and A̅ means ¬A which means ‘not A’. Among the 16 and groupings there are just 8 common subexpressions.

These claims hold whatever the bottom turtle circuits: nand, nor, and plus or. No not needed.

This is just two logic levels in any circuit family with near ideal fan-in and fan-out. I think the (ECL) loads people were happy because just two of the four signals are up at a time.

Latency assumes a clock or two to compute 3 times multiplicand and simultaneously derive the carries saved in this scheme. Then there are 18 numbers to be summed to produce the product. Using CSA we reduce this count thus:

53/3 = 18 → 12 → 8 → 6 → 4 → 3 → 2 ⟶ 1
Each short arrow is two logic levels as above. The long arrow is a full add. Subsequently there is a contingent one bit shift. It is easy to recycle some of this hardware to reduce gate count and increase latency and thruput. Wires seem short here, fan-in and out are nearly ideal. John Cocke suggested double clocking this part of the ACS machine to get twice the work out of these gates.

See “A∧B̅∧C̅” in this page about lines above letters.