Here is another stab at code following the format B ideas. Note the one line of obscure code devoted to the 4 bit turn ops. This code, compiled with clang, pretends to perform routine NewDG in 2.5 ns on a 2.4 Gb/s CPU. I think that clang is clever at omitting code that does not contribute to the output, but the assembler listing indicates the code is there and being performed.