It occurred to me that SHA-3 being a cryptographic hash, it is one of those peculiar bits of code that are fully self-testing. Any bug in a cryptographic hash will quickly cascade to a different result, no matter the bug or the input.
This means the ad-hoc-compiler-monkey can be unleashed “safely”, and can be allowed to try “improper changes.”
The optimizations in the previous ad hoc SHA-3 compiler  were rather tame (and cute), they only tried “safe” changes, that would not alter correct code behavior. Of course that severely restricted what it could do, leading to “an optimum given the current compiler limitations” as noted at the end of the article.
The new optimizer is more like a rampaging monkey, it is allowed to try:
- any permutation in the Pascal source code
- any permutation in the compiled asm code
Yes, even permutations that break correct behavior.
And it also gained a bit of creativity: when it encounters an instruction that involves a memory read access, it is allowed to insert a read access of that memory to a register (which it chooses at random). The greedy optimizer is then invoked to replace subsequent memory accesses by register accesses. Most of the time, this will trash another register and result in incorrect code. But that’s fine.
Previously the optimization goal was just to reduce the number of memory accesses, it was changed to reducing the number of memory accesses and execution time.
So for after each mutation, the asm is compiled to binary (using a ad hoc assembler, of course), tested for correctness (by computing a few hashes), and if correct, it is benchmarked.
All the above are rather simple to implement on the human side, but make for a more interesting challenge on the monkey side… and the monkeys are still busy as I write this.
After a few hours they have found a better solution: the number of memory accesses is further reduced from 144 to 132 (human MMX version had more than 180), and execution time is down by about 10%. I will let the optimizer run for a couple extra days.
The optimized Pascal code becomes quite spaghetti with heavy interlacing, and the asm is even more complicated. Below are the first few lines as illustration:
C1 := A xor A xor A xor A xor A;
C4 := A xor A xor A xor A xor A;
D := RotL1(C1) xor C4;
C2 := A xor A xor A xor A xor A;
B := RotL(A xor D1, 36);
D := RotL1(C4) xor C2;
B := RotL(A xor D4, 21);
C0 := A xor A xor A xor A xor A;
B := RotL(A xor D1, 18);
C3 := A xor A xor A xor A xor A;
B := RotL(A xor D1, 3);
D := RotL1(C2) xor C0;
Some interesting patterns emerge in all optimization runs, like the C0 computation always being moved down. None of the optimization runs kept it on top, while C1 computation always stays on top.
Monkeys have ways that are for them to know and for us to find out.