Optimistic Unicode case-insensitive CompareText

In Unicode Delphi, post-Delphi 2009, there are two ways of making case-insensitive string comparisons, CompareText, which only does case-insensitivity in the ASCII range (non-accentuated characters), and the judiciously misnommed AnsiCompareText, which works on the whole Unicode range by calling into the Windows API.

Alas, AnsiCompareText is slow, very slow, as illustrated by TStringList in this post f.i., not just because Microsoft didn’t do a good job at implementing it, but also because the Unicode Consortium did its very best to scatter case-sensitive characters across the whole set… In addition you have special locale-based collation rules to deal with, which have to be applied independently from Unicode.

However, if you’re dealing with primarily latin-based strings, and looking at case-sensitive matching (rather than ordering), it’s possible to cut down significantly on the complexity of AnsiCompareText by replacing it with an “Optimistic UnicodeCompareText“, which will use an ASCII-based strategy like CompareText does, and fallback to the Windows API function when, and only when a non-ASCII character is encountered.

Of course, it won’t help with Cyrillic or Greek, but it will work well for most western-languages, where accentuated and special characters are infrequent (such as german or french).

You can find such a UnicodeCompareText function in DWScript‘s dwsUtils.pas unit, it’s being used to add support for Unicode identifiers in DWS, without incurring a performance penalty for all the pure ASCII identifiers.

5 thoughts on “Optimistic Unicode case-insensitive CompareText

  1. Right. A more common way to do that is to cast to Pointer before casting to PChar.

    Interesting optimization, Eric.

    Looked quickly at the code (not tested), but I think that the compiler might generate slightly better code by changing this:

    if c1 in [Ord(‘a’)..Ord(‘z’)] then
    c1:=c1+Ord(‘A’)-Ord(‘a’);

    to use a case statement..?

    .

  2. @Hallvard Vassbotn
    AFAICT a case statement generates the same code in Delphi XE.
    John O’Harrow’s version (in the RTL) uses two compares rather than a range check, I haven’t tested if that would behave better (John’s version also checks two characters at a time).

  3. @Eric
    Thanks. Should have worked that out, and probably would have done if you had used a pointer cast! And it is worth adding that this can only be used safely because you have established that the strings are not empty.
    @Hallvard
    It seems that the generated code (in XE) for the set test is pretty efficient: I have not looked to see what it would generate for a case statement but my guess is not much different.

Comments are closed.