A trivial way to turn a case-sensitive String hash function into a case-insensitive one is to to pass a lower-case (or upper-case) version of the String.
However, in our days of Unicode strings, this is not innocuous…
Running a quick benchmark on SimpleStringHash (a FNV-1a derivative), I got the following figures (execution time, lower is better):
- SimpleStringHash( str ) : 100
- SimpleStringHash(LowerCase( str ) : 423
- SimpleStringHash(UnicodeLowerCase( str )) : 1628
LowerCase() in Delphi is not Unicode-capable, just ASCII, but still involves a memory allocation, and thus ends up 4x slower.
The UnicodeLowerCase() calls involves the Windows API CharLowerBuffW, and full Unicode support (and complexity), and so ends up 16x slower.
I found that in practice, when a hash and case-sensitivity are involved, the majority of strings are actually ASCII, with the occasionnal one involving Unicode for accents, cedillas, etc.
This can be taken advantage of in an optimistic implementation: assume ASCII, and fallback if the assumption fails.
Given the difference in execution times, the optimistic implementation will be beneficial even if its assumption is incorrect 90% of the time. The ASCII lower-case logic is simple enough and can be baked in:
function SimpleLowerCaseStringHash(const s : String) : Cardinal; function Fallback(const s : String) : Cardinal; begin Result := SimpleStringHash(UnicodeLowerCase(s)); end; var i : Integer; c : Word; begin // modified FNV-1a using length as seed Result := Length(s); for i := 1 to Result do begin c := Ord(s[i]); if c > 127 then // not ASCII, need to fallback Exit( Fallback(s) ) else if c in [Ord('A')..Ord('Z')] then begin // ASCII lower-case logic c := c + (Ord('a')-Ord('A')); end; Result := (Result xor c) * 16777619; end; end;
In the above code, the use of a local procedure is not innocent: a function returning a string (UnicodeLowerCase) means an implicit exception frame and reference counting overhead. By using a local procedure, we ensure that this overhead is not present when hashing an ASCII string. For more details, you may want to read this article.
SimpleLowerCaseStringHash executes the benchmarks with the score of 170 when the strings are ASCII, so 2.5x faster than even SimpleStringHash(LowerCase()), and more multi-thread-friendly as it involves no memory allocations.