It occurred to me that SHA-3 being a cryptographic hash, it is one of those peculiar bits of code that are fully self-testing. Any bug in a cryptographic hash will quickly cascade to a different result, no matter the bug or the input.

This means the ad-hoc-compiler-monkey can be unleashed “safely”, and can be allowed to try “improper changes.”

The optimizations in the previous ad hoc SHA-3 compiler were rather tame (and cute), they only tried “safe” changes, that would not alter correct code behavior. Of course that severely restricted what it could do, leading to “an optimum given the current compiler limitations” as noted at the end of the article.

#### Rampaging Optimizer

The new optimizer is more like a rampaging monkey, it is allowed to try:

**any**permutation in the Pascal source code**any**permutation in the compiled asm code

Yes, even permutations that break correct behavior.

And it also gained a bit of creativity: when it encounters an instruction that involves a memory read access, it is allowed to insert a read access of that memory to a register (which it chooses at random). The greedy optimizer is then invoked to replace subsequent memory accesses by register accesses. Most of the time, this will trash another register and result in incorrect code. But that’s fine.

Previously the optimization goal was just to reduce the number of memory accesses, it was changed to reducing the number of memory accesses and execution time.

So for after each mutation, the asm is compiled to binary (using a ad hoc assembler, of course), tested for correctness (by computing a few hashes), and if correct, it is benchmarked.

All the above are rather simple to implement on the human side, but make for a more interesting challenge on the monkey side… and the monkeys are still busy as I write this.

#### Early Results

After a few hours they have found a better solution: the number of memory accesses is further reduced from 144 to 132 (human MMX version had more than 180), and execution time is down by about 10%. I will let the optimizer run for a couple extra days.

The optimized Pascal code becomes quite spaghetti with heavy interlacing, and the asm is even more complicated. Below are the first few lines as illustration:

`C1 := A[01] `

**xor** A[06] **xor** A[11] **xor** A[16] **xor** A[21];

C4 := A[04] **xor** A[09] **xor** A[14] **xor** A[19] **xor** A[24];

D[1] := RotL1(C1) **xor** C4;

C2 := A[02] **xor** A[07] **xor** A[12] **xor** A[17] **xor** A[22];

B[16] := RotL(A[05] **xor** D1, 36);

D[4] := RotL1(C4) **xor** C2;

B[03] := RotL(A[18] **xor** D4, 21);

C0 := A[00] **xor** A[05] **xor** A[10] **xor** A[15] **xor** A[20];

B[14] := RotL(A[20] **xor** D1, 18);

C3 := A[03] **xor** A[08] **xor** A[13] **xor** A[18] **xor** A[23];

B[07] := RotL(A[10] **xor** D1, 3);

D[2] := RotL1(C2) **xor** C0;

...

Some interesting patterns emerge in all optimization runs, like the C0 computation always being moved down. None of the optimization runs kept it on top, while C1 computation always stays on top.

Monkeys have ways that are for them to know and for us to find out.

It would make sense to run it on several CPUs, from both Intel and AMD, with various generations.

Are you sure that your tests do reflect the real use-case? L1/L2 caching means a lot in such computation, and some implementations do shine in a loop when every data is in L1/L2, but may be slower when the data is not in cache. I guess that some asm to avoid caching and/or enable prefetch at all would help, for huge input. I do not remember if MMX supports MOVNTI “non-temporal hint” instructions.

@Arnaud it is being run on an Xeon E3 (hosted server with CPU to spare), and I check the results on an AMD Phenom II as well. While L1 does play a role, I do not expect L2 does, as there is comparatively little work in the hash that is related to the data being hashed (a single xor pass vs 24 permutation rounds), and the permutation data fits well within L1. Most stressed portion appear to be L1 read/write and basic arithmetic units (xor, shift, and).

Actually the xor pass that brings the data being hashed into is actually in the Pascal function “XORIntoState”, the permutation kernel basically works in static memory (hash state & stack).

@Eric Fair enough.

Very interesting.

I was surprised you’re running the optimizer on the assembly as strings – from the previous blog posts, it seems to be custom-written for that specific assembler code. Wouldn’t it be better to have one that runs on opcodes, which could be pointed at any bit of assembler and run on it?

@David it is easier to work on and mutate strings rather than opcodes, where register changes f.i. are bit fiddling, while mutating asm strings is trivial to test, and ad hoc compiling is trivial as well. Yes, it’s less efficient in terms of computer time, but the whole idea here is that I have access to plenty of computer time (days… weeks if you count multiple cores separately), but limited human time (hours at best, including debugging).

Also being able to point it an any bit of assembler would be a vastly more complex undertaking, and mutating assembler in only possible and fully testable/safe in edge cases, like cryptographic hashes or math vector computation.

Thanks Eric.