This divide scheme can be used with CSA reasoning to get further speedup. Each cycle produces log n bits but as described one must wait for the entire ~64 bit carry ripple. If one is willing to do with one less bit per cycle then that many bits are reliably available without this late carry signal. See This and demo.