Here is another stab at code following the format B ideas.
Note the one line of obscure code devoted to the 4 bit turn ops.
This code, compiled with clang, pretends to perform routine NewDG in 2.5 ns on a 2.4 Gb/s CPU.
I think that clang is clever at omitting code that does not contribute to the output, but the assembler listing indicates the code is there and being performed.