A trivial way to turn a case-sensitive String hash function into a case-insensitive one is to to pass a lower-case (or upper-case) version of the String.
However, in our days of Unicode strings, this is not innocuous…
Running a quick benchmark on SimpleStringHash (a FNV-1a derivative [1]), I got the following figures (execution time, lower is better):
- SimpleStringHash( str ) : 100
- SimpleStringHash(LowerCase( str ) : 423
- SimpleStringHash(UnicodeLowerCase( str )) : 1628
LowerCase() in Delphi is not Unicode-capable, just ASCII, but still involves a memory allocation, and thus ends up 4x slower.
The UnicodeLowerCase() calls involves the Windows API CharLowerBuffW, and full Unicode support (and complexity), and so ends up 16x slower.
I found that in practice, when a hash and case-sensitivity are involved, the majority of strings are actually ASCII, with the occasionnal one involving Unicode for accents, cedillas, etc.
This can be taken advantage of in an optimistic implementation: assume ASCII, and fallback if the assumption fails.
Given the difference in execution times, the optimistic implementation will be beneficial even if its assumption is incorrect 90% of the time. The ASCII lower-case logic is simple enough and can be baked in:
function SimpleLowerCaseStringHash(const s : String) : Cardinal; function Fallback(const s : String) : Cardinal; begin Result := SimpleStringHash(UnicodeLowerCase(s)); end; var i : Integer; c : Word; begin // modified FNV-1a using length as seed Result := Length(s); for i := 1 to Result do begin c := Ord(s[i]); if c > 127 then // not ASCII, need to fallback Exit( Fallback(s) ) else if c in [Ord('A')..Ord('Z')] then begin // ASCII lower-case logic c := c + (Ord('a')-Ord('A')); end; Result := (Result xor c) * 16777619; end; end;
In the above code, the use of a local procedure is not innocent: a function returning a string (UnicodeLowerCase) means an implicit exception frame and reference counting overhead. By using a local procedure, we ensure that this overhead is not present when hashing an ASCII string. For more details, you may want to read this article [2].
SimpleLowerCaseStringHash executes the benchmarks with the score of 170 when the strings are ASCII, so 2.5x faster than even SimpleStringHash(LowerCase()), and more multi-thread-friendly as it involves no memory allocations.