String unification

This article deals with reference-counted String theory, it doesn’t relate to String Theory, unification of gravity with other quantum forces or canadian fashion.

Instead it deals with leveraging reference-counted String data to unify them and minimize memory usage requirements.

edit: as pointed in the comments, the usual terminology for that optimization is interning, which is a misnomer derived from an old LISP function “intern” which related to internal/external and packages.

Dispersed strings

As you’re probably aware, Delphi String memory management is automatic, and this is down through reference counting and copy-on-write, hence when you have code like

s1 := 'this is my string';
s2 := s1;

then s2 will refer to the same memory storage as s1, and you’ll have “this is my string” only once in memory. This is all fine and dandy, but if now if you have instead:

s := 'this is my ';
s1 := s + 'string';
s2 := s + 'string';

then s1 and s2 each hold distinct copy of “this is my string“.

Similar cases can be common in dynamically generated strings or when strings are read  from a file or database. If you have only two such strings, this isn’t a problem, but what if you have millions of them?

Or if you have multiple columns/rows of string data where the same data can be found multiple times? XML tags? attributes, node names, etc.

There are two main downsides to having many duplicates:

  • memory consumption: mostly visible when your data no longer fits in the CPU cache (about a megabyte), or when you run out of virtual memory in 32bit modes
  • memory manager overhead: each of those strings needs to be allocated & freed separately

Unifying the fragments

Apart from log files and database tables, this is a problem that DWScript encounters: the same names keep appearing over and over again, in variable names, parameter names, method names… so in dwsUtils you’ll find a UnifyAssignString function:

procedure UnifyAssignString(const fromStr : String; var toStr : String);
function UnifiedString(const fromStr : String) : String;

What it does is maintain an internal list of known strings, and if fromStr is amongst those known strings, instead of copying it toStr, it’ll copy the known string to toStr, so when doing

s := 'this is my ';
UnifyAssignString(s + 'string', s1);
UnifyAssignString(s + 'string', s2);

there will only be one “this is my string” in memory.

Of course this has a cost  as “known strings” need to be looked up, and it can accumulate garbage strings (the “known strings”), so you need to do it only for strings that have a high likelihood of being among known strings, and/or should periodically call TidyStringsUnifier.

As usual, the exact balance between what is beneficial to be unified and what isn’t will vary, but if you have strings that are repeated often enough, you may find that the cost of unifying strings can be more than paid back by the greater CPU cache friendliness and/or reduced pressure on the memory manager.

Measuring it

What kind of unification overhead can be expected?

The string unifyer uses an array of dedicated hash tables, which provide constant-time lookups and a statistical amount of thread-friendliness.

Let’s make an artificial case:

   a : array [1..2000000] of String;

procedure TestIt;
   i : Integer;
   for i := 1 to High(a) do
      a[i] := IntToStr(i and 65535);

So we’re filling up an array of strings with numbers in the 0..65535 range, IntToStr() is a quite simple way to obtain non-unique strings, so it should make the overhead more significant than other uses in the wild (such as strings fetched from a database or a network)

With Delphi XE, the above code runs in

  • 196 ms, and uses 76 MB of RAM with ‘:=’
  • 300 ms, and uses 17 MB of RAM witgh UnifyAssignString

There is about 50% of execution overhead, though that overhead could be hidden if you’re obtaining the strings through a slow channel (such as a network or database request).

What if you then make use of those strings?

For instance let’s count the number of ‘12345’ strings in the array:

procedure UseThem;
   n, i : Integer;
   n := 0;
   for i := 1 to High(a) do
      if a[i] = '12345' then Inc(n);

With dispersed strings this executes in 18 ms, and with unified strings, this executes in 11 ms, or a 40% speed advantage. This is mostly thanks to an optimization in the internal RTL function _UStrCmp by Pierre Le Riche, which shortcuts string comparison when the string pointers are identical.

edit: as noted by Hallvard Vassbotn in the comments, the previous optimization is actually not active in the above snippet, you need to unify the ‘12345’ string for that. In practice this results in a further speedup of about 0.3 ms, so the speedup is mostly thanks to better cache usage.

However there are also gains from the lower memory usage, for instance if you instead count the number of strings that begin with a ‘1’ by changing the “if” to

if a[i][1] = '1' then Inc(n);

Execution time is now 10 ms for dispersed strings vs 7 ms for unified, or a 30% difference, which essentially comes from more of the data being able to fit in the CPU cache.

Usage scenarios

This is definitely of a special case optimization, but it can be handy when you have a lot of redundancy, are tight in terms of memory, or want to load and process whole data-sets of string content in RAM, obtained from a database or a detailed log.

This can help you to brute-force vast amounts of redundant data into memory, and thus help keep the code simple.

Another usage case is when you’re loading a significant amount of strings, which will remain static throughout the rest of an execution, but are going to be involved in a lot of CPU work.

In DWScript this is applied selectively, and is instrumental in keeping memory usage in check for some large automatically imported libraries, while helping slightly with the compile times.

8 thoughts on “String unification

  1. @mghie Yep, sounds like the same thing. The “intern” name sounds pretty bad however, or at least, I don’t see how it relates to any of the common meaning of the word… (edit: found it, it’s named after an old LISP function, in which it was related to internal/external).

    In design patterns they call that “FlyWeight”, which is an equally inexpressive name IMHO.

    I’ve heard about constant unification way back in the last century, when it was done in the compilers after constant folding. So that’s why I picked that one, but AFAICT this is a false friend that doesn’t work in English. Oh well 🙂

  2. Yes, in so many other places this is known as interning that I think that terminology has won.

  3. > shortcuts string comparison when the string pointers are identical.

    Yes, but that case does not trigger unless you intern/unify the ‘12345’ constant, too. And even if you do, it will only match in 1/65535 of the cases.

    So the real performance boost is due to your code, improving data cache reuse. In the plain code, there is no cache reuse at all.

Comments are closed.