DelphiTools Optimistic Case-Insensitive String Hash

A trivial way to turn a case-sensitive String hash function into a case-insensitive one is to to pass a lower-case (or upper-case) version of the String.

However, in our days of Unicode strings, this is not innocuous…

Running a quick benchmark on SimpleStringHash (a FNV-1a derivative ^[1]), I got the following figures (execution time, lower is better):

SimpleStringHash( str ) : 100
SimpleStringHash(LowerCase( str ) : 423
SimpleStringHash(UnicodeLowerCase( str )) : 1628

LowerCase() in Delphi is not Unicode-capable, just ASCII, but still involves a memory allocation, and thus ends up 4x slower.

The UnicodeLowerCase() calls involves the Windows API CharLowerBuffW, and full Unicode support (and complexity), and so ends up 16x slower.

I found that in practice, when a hash and case-sensitivity are involved, the majority of strings are actually ASCII, with the occasionnal one involving Unicode for accents, cedillas, etc.

This can be taken advantage of in an optimistic implementation: assume ASCII, and fallback if the assumption fails.

Given the difference in execution times, the optimistic implementation will be beneficial even if its assumption is incorrect 90% of the time. The ASCII lower-case logic is simple enough and can be baked in:

function SimpleLowerCaseStringHash(const s : String) : Cardinal;

   function Fallback(const s : String) : Cardinal;
   begin
      Result := SimpleStringHash(UnicodeLowerCase(s));
   end;

var
   i : Integer;
   c : Word;
begin
   // modified FNV-1a using length as seed
   Result := Length(s);
   for i := 1 to Result do begin
      c := Ord(s[i]);
      if c > 127 then // not ASCII, need to fallback                            
         Exit( Fallback(s) )
      else if c in [Ord('A')..Ord('Z')] then begin  // ASCII lower-case logic
         c := c + (Ord('a')-Ord('A'));
      end;
      Result := (Result xor c) * 16777619;
   end;
end;

In the above code, the use of a local procedure is not innocent: a function returning a string (UnicodeLowerCase) means an implicit exception frame and reference counting overhead. By using a local procedure, we ensure that this overhead is not present when hashing an ASCII string. For more details, you may want to read this article ^[2].

SimpleLowerCaseStringHash executes the benchmarks with the score of 170 when the strings are ASCII, so 2.5x faster than even SimpleStringHash(LowerCase()), and more multi-thread-friendly as it involves no memory allocations.