Optimistic Case-Insensitive String Hash

clockworkA trivial way to turn a case-sensitive String hash function into a case-insensitive one is to to pass a lower-case (or upper-case) version of the String.

However, in our days of Unicode strings, this is not innocuous…

Running a quick benchmark on SimpleStringHash (a FNV-1a derivative), I got the following figures (execution time, lower is better):

  • SimpleStringHash( str ) : 100
  • SimpleStringHash(LowerCase( str ) : 423
  • SimpleStringHash(UnicodeLowerCase( str )) : 1628

LowerCase() in Delphi is not Unicode-capable, just ASCII, but still involves a memory allocation, and thus ends up 4x slower.

The UnicodeLowerCase() calls involves the Windows API CharLowerBuffW, and full Unicode support (and complexity), and so ends up 16x slower.

I found that in practice, when a hash and case-sensitivity are involved, the majority of strings are actually ASCII, with the occasionnal one involving Unicode for accents, cedillas, etc.

This can be taken advantage of in an optimistic implementation: assume ASCII, and fallback if the assumption fails.

Given the difference in execution times, the optimistic implementation will be beneficial even if its assumption is incorrect 90% of the time. The ASCII lower-case logic is simple enough and can be baked in:

function SimpleLowerCaseStringHash(const s : String) : Cardinal;

   function Fallback(const s : String) : Cardinal;
   begin
      Result := SimpleStringHash(UnicodeLowerCase(s));
   end;

var
   i : Integer;
   c : Word;
begin
   // modified FNV-1a using length as seed
   Result := Length(s);
   for i := 1 to Result do begin
      c := Ord(s[i]);
      if c > 127 then // not ASCII, need to fallback                            
         Exit( Fallback(s) )
      else if c in [Ord('A')..Ord('Z')] then begin  // ASCII lower-case logic
         c := c + (Ord('a')-Ord('A'));
      end;
      Result := (Result xor c) * 16777619;
   end;
end;

In the above code, the use of a local procedure is not innocent: a function returning a string (UnicodeLowerCase) means an implicit exception frame and reference counting overhead. By using a local procedure, we ensure that this overhead is not present when hashing an ASCII string. For more details, you may want to read this article.

SimpleLowerCaseStringHash executes the benchmarks with the score of 170 when the strings are ASCII, so 2.5x faster than even SimpleStringHash(LowerCase()), and more multi-thread-friendly as it involves no memory allocations.

2 thoughts on “Optimistic Case-Insensitive String Hash

  1. Eric,

    I think your code has a problem with the “Turkish i” :
    https://msdn.microsoft.com/en-us/library/ms994325.aspx#cltsafcode_topic4

    In short, if you are in a turkish locale, the uppercase of “i” is “İ” and the lowercase of “I” is “ı”. You need to do an unicode lowercase/uppercase for it to work well.

    One could say we could just ignore this particular case, but there is that infamous case were some people was actually killed because of the lack of support for dotless i:
    http://gizmodo.com/382026/a-cellphones-missing-dot-kills-two-people-puts-three-more-in-jail

  2. @adrian if you want to support the “Turkish i” you have a broader issue: you need to specify the locale collation, which means you can use none of the Unicode functions that do not accept a locale collation, so the alternative solutions listed in the article would fail as well.

    That said, if you’re going to safeguard your code for the Turkish locale, it’s trivial to adapt the above code for it, much more trivial than adapting the rest of the code, especially if your app will not just deal exclusively with Turkish strings. For starters, you’ll need to pass a collation alongside every string, because an upper case i with dot is incorrect in other locales.

Comments are closed.