A trivial way to turn a case-sensitive String hash function into a case-insensitive one is to to pass a lower-case (or upper-case) version of the String.
However, in our days of Unicode strings, this is not innocuous…
Running a quick benchmark on SimpleStringHash (a FNV-1a derivative [1]), I got the following figures (execution time, lower is better):
- SimpleStringHash( str ) : 100
- SimpleStringHash(LowerCase( str ) : 423
- SimpleStringHash(UnicodeLowerCase( str )) : 1628
LowerCase() in Delphi is not Unicode-capable, just ASCII, but still involves a memory allocation, and thus ends up 4x slower.
The UnicodeLowerCase() calls involves the Windows API CharLowerBuffW, and full Unicode support (and complexity), and so ends up 16x slower.
I found that in practice, when a hash and case-sensitivity are involved, the majority of strings are actually ASCII, with the occasionnal one involving Unicode for accents, cedillas, etc.
This can be taken advantage of in an optimistic implementation: assume ASCII, and fallback if the assumption fails.
Given the difference in execution times, the optimistic implementation will be beneficial even if its assumption is incorrect 90% of the time. The ASCII lower-case logic is simple enough and can be baked in:
function SimpleLowerCaseStringHash(const s : String) : Cardinal;
function Fallback(const s : String) : Cardinal;
begin
Result := SimpleStringHash(UnicodeLowerCase(s));
end;
var
i : Integer;
c : Word;
begin
// modified FNV-1a using length as seed
Result := Length(s);
for i := 1 to Result do begin
c := Ord(s[i]);
if c > 127 then // not ASCII, need to fallback
Exit( Fallback(s) )
else if c in [Ord('A')..Ord('Z')] then begin // ASCII lower-case logic
c := c + (Ord('a')-Ord('A'));
end;
Result := (Result xor c) * 16777619;
end;
end;
In the above code, the use of a local procedure is not innocent: a function returning a string (UnicodeLowerCase) means an implicit exception frame and reference counting overhead. By using a local procedure, we ensure that this overhead is not present when hashing an ASCII string. For more details, you may want to read this article [2].
SimpleLowerCaseStringHash executes the benchmarks with the score of 170 when the strings are ASCII, so 2.5x faster than even SimpleStringHash(LowerCase()), and more multi-thread-friendly as it involves no memory allocations.