- DelphiTools - https://www.delphitools.info -

Long strings: Hash vs Sorted vs Unsorted

Hash [1], IndexOf [2], Lookup [3], Performance [4], Search [5], Sorted [6], String [7]long_randomlong_stringsAs a followup to the previous String Lookup: Hash, Sorted or Unsorted [8] here is a look at what happens for longer strings.

When the string you’re searching have more than a dozen characters, and you only have a few hundreds of them, a clear winner emerges.

First chart is for searching strings of length 250, where the first 128 characters are ‘a’ and the rest is random.

These similar characters at the start form a of handicap for string comparisons as 128 characters will have to be looked up before the comparisons can have a chance to be decided. This is penalizing the Unsorted and Sorted list cases, as these rely solely on comparisons.

long_strings [9]

(sorry the chart is a bit noisy, I did not run too many iterations)

Without the handicap, if the long string characters are purely random, well, the verdict is clear

long_random [10]

For long strings, the cost of computing the hash-code makes both hash maps implementations non-competitive, unless you have a bazillion of long strings, and a humble sorted TStringList (with CompareStrings method overridden to case-insenstive) is having a field day.

The overhead of computing the hash code is so significant, that up to a few dozen strings, even an unsorted TStringList will outperform TDictionary.

Interestingly enough, what is true for long string hash code computation will also be true for long objects (or records) hash code computations. So if your keys are large enough, and you do not have an awful lot if items, chances are a hash map will not be a very effective solution.

Strategies such as caching the hash code could help, but only if an hash code is going to be reused “enough times” to pay for its initial computation cost.

Of course hash maps have other advantages, such as being able to add and remove items cheaply (while it can be expensive in a sorted list), but raw search/lookup performance is not always going to be a given.