The Curiouser and Curiouser Case of Case-Insensitive Tweaks

Recent commits to the DWScript repository doubled the compiler performance when compiling many small scripts, like happens in the unit tests suites.

This started from a first profiling run where the memory allocations around the UnicodeLowerCase function came out as top bottlenecks.

Thing is, Pascal being a case-insensitive language, there are lots of case-insensitive comparisons, lookups, searches and hashes, and turns out a key hash code was computed with code like

hashCode := SimpleStringHash(UnicodeLowerCase(string))

So with a memory allocation for each hash code computation. This was quickly solved with an optimistic implementation, taking advantage of (1) most identifiers actually being ASCII and (2) most identifiers being short.

The result is a new SimpleStringLowerCaseHash function.

But benchmarking yielded an oddity: Win32 version was now much faster than Win64!

A round of profiling pointed fingers at the xxHash implementation, which benefitted from an asm implementation in Win32, but none for Win64.

Now, usually the Delphi Win64 works quite well, but xxHash relies on bitwise rotation, which are trivial in asm, but not have to be emulated with with shifts at the language level. So the fix was to bring the asm to Win64.

Another benchmark demonstrated a marked improvement, but… Win32 was still ahead! Another round of profiling pointed to a specific line, which looked innocuous enough:

if c in [ Ord('A')..Ord('Z') ] then

Looking at what it’s compiled to, in Win32 we have

mov ecx, edx
add ecx, -$41
sub ecx, $1a
jnb $0124e097

which is good, it converts the range comparison to a subtraction (aptly disguised as an addition) and an unsigned comparison to zero (aptly disguised as as a subtraction).

But on the Win64 side, it’s compiled to something a little bit more arcane

mov   ecx, edx
sub   ecx, $40
cmp   ecx, $1f
jnbe  SimpleStringLowerCaseHash + $9D
mov   edi, $00000001
shl   edi, cl
mov   ecx, [rel $00000046]
test  edi, ecx
setnz cl
jmp   SimpleStringLowerCaseHash + $9F
xor   ecx, ecx
test  cl, cl
jz    SimpleStringLowerCaseHash + $AE

even without understanding what’s going, it’s obvious the compiler is doing something more complicated… The first lines (sub & cmp) are doing the same thing the Win32 compiler was doing, which is convert the range comparison to a subtraction (the sub) and an unsigned comparison (the cmp).

That’s when thing take a dark turn, because instead of just proceeding like the Win32 compiler did, the Win64 compiler is creating a temporary boolean, adjusting the  and then jumping on it. It’s a bit like doing:

var bool : Boolean;;
if (c in [ Ord('A')..Ord('Z') ])  = True then
   bool := True
else bool := False;
if bool then ...

No wonder it’s slower: we’re in the inner loop here, and doing that for all characters in a string.
The workaround is to make it simpler for the compiler:

// if Cardinal(c - Ord('A')) <= Cardinal(Ord('Z') - Ord('A')) then
mov  esi, eax
sub  esi, $41
cmp  esi, $19
jnbe SimpleStringLowerCaseHash + $8C

and now the Win64 compiler generates similar code as Win32, just without aptly disguising the subtraction and comparison.

 

6 thoughts on “The Curiouser and Curiouser Case of Case-Insensitive Tweaks

  1. If you want to improve the performance of SimpleStringLowerCaseHash even further you might want to look at 2 Chars at once and do the uppercasing without a branch. For some inspiration, you might want to take a look at GetHashCode_UString_OrdinalCaseInsensitive in Spring.Comparers.pas

  2. Thanks. Interesting approach, the “or $20” adds collisions a few non-letter characters, but those wouldn’t matter for identifiers (like @ with backtick, or _ with DELETE).

  3. @Stefan committed an update with SimpleStringCaseInsensitiveHash, it’s about twice as fast as the older one, unfortunately that’s not noticeable in global benchmark as its wasn’t anywhere near a bottleneck anymore

    @Arnaud, sounds like the branchless version is something that could be adapted to a CompareText, which is still a bottleneck (as a comparison is required in the hashtable to confirm a hit)

  4. @Eric
    You can make a branchless uppercase conversion + hash compute with this pattern, one NativeInt at a time.

  5. @Arnaud, but for hash the approximated ‘or $20’ lowercase approach is even faster, and the approximation is fine for identifier strings (if not for general string use). For CompareText, an exact version like yours is required.
    I’ll run some benchmarks, because statistically, case matches more often than not when compiling, and the overhead of going branchless may be detrimental vs a branching comparison that is well predicted.

Leave a Reply

Your email address will not be published. Required fields are marked *