Here is a simple introduction of addition latency reduction. Suppose we need to add 64 bit numbers and we have designed circuits to add 32 bit numbers. Divide the two inputs x and y each into two halves:
x = 232xhi + xlow
y = 232yhi + ylow
The new variables are 32 bit numbers. Our circuits produce 33 bit results and can actually add two 32 bit numbers and a one bit number. carryIn is the input carry to the 64 bit task. Consider the C code:
int x0 = xhi+yhi, x1 = xhi+yhi + 1, zlow = xlow+ylow + carryIn;
int zhi = zlow>>32?x1:x0;
Now the 64 bit answer is z = 232(modulo(zhi, 232)) + modulo(zlow, 232) and the 64 bit carry-out is zhi>>32. Observe that in hardware the first line of code can be done concurrently with 3 32 bit adders. The second line of code is merely a select between x0 and x1. The total time is one 32 bit add plus one level of mux. If this trick is done recursively then the time cost is logarithmic in the number of bits. The circuit cost is n∙log(n).

Here is a variation on this idea.