DelphiTools Getting Rid of the Middleman

On this StackOverflow question ^[1] David Heffernan asked about a hack I’m using in DWScript’s UnifyAssignString.

TStringListCracker

The code of the function looks like:

procedure UnifyAssignString(const fromStr : UnicodeString; var toStr : UnicodeString);
var
   i : Integer;
   sl : TUnifierStringList;
begin
   if fromStr = '' then
      toStr := ''
   else begin
      i := Ord(fromStr[1]) and High(vCharStrings);
      sl := vCharStrings[i];
      sl.FLock.Enter;
      i := sl.AddObject(fromStr, nil);
      toStr := TStringListCracker(sl).FList[i].FString; // HACK HERE
      sl.FLock.Leave;
   end;
end;

The non-hacky variant would be to have that line just be

toStr := sl[i];

but if you use that simpler form, you could see a performance drop by up to 45% (Delphi XE and XE2 32bit) or 38% (Delphi XE2 64bit).

Ouch! Why is that?

The hacky version

The hacky version compiles to a single UStrAsg, the internal function for string assignment which takes care of the reference-counting. So it’s just:

dwsUtils.pas.487: toStr:=TStringListCracker(sl).FList[i].FString;
00511594 8B0424           mov eax,[esp]
00511597 8B532C           mov edx,[ebx+$2c]
0051159A 8B14F2           mov edx,[edx+esi*8]
0051159D E8A659EFFF       call @UStrAsg

i having been returned by AddObject, no range-check is required.

(note that TStrings.AddObject is called directly because TStrings.Add is just an indirection to AddObject)

What about the “toStr := sl[i]” version?

The property is just syntax sugar on a getter method, so what is actually compiled is “toStr := sl.Get(i)“, which itself, being a virtual call, looks like

toStr := sl.VMT[ TStrings_Get ]( sl, i );

However the virtual call to TStringList.Get on its own costs only a tiny bitsy of extra CPU cycles, though it also prevents inlining (at least, until the compiler gets de-virtualisation optimizations).

But there are other factors at work, the virtual call on its own can’t come close to explaining the performance differential, as after all, the rest of the code UnifyAsstringString is far from trivial (with a critical section and a binary search…)

First, it’s a function returning a string, which is a managed type, so it’s actually compiled to something like:

try
   temp := sl.getValue( i );
   UStrAsg( toStr, temp );
finally
   UStrClr( temp );
end;

Second, TStringList.GetValue itself is compiled to

TStringList.Get:
00448204 53               push ebx
00448205 56               push esi
00448206 57               push edi
00448207 8BF9             mov edi,ecx
00448209 8BF2             mov esi,edx
0044820B 8BD8             mov ebx,eax
0044820D 3B7330           cmp esi,[ebx+$30]
00448210 720F             jb $00448221
00448212 8B1544EA5100     mov edx,[$0051ea44]
00448218 8BCE             mov ecx,esi
0044821A 8BC3             mov eax,ebx
0044821C E8C3E3FFFF       call TStrings.Error
00448221 8BC7             mov eax,edi
00448223 8B532C           mov edx,[ebx+$2c]
00448226 8B14F2           mov edx,[edx+esi*8]
00448229 E81AEDFBFF       call @UStrAsg
0044822E 5F               pop edi
0044822F 5E               pop esi
00448230 5B               pop ebx
00448231 C3               ret

That’s quite a lot of code, a fair share of it is spent juggling registers because TStringList.Get is implemented as

function TStringList.Get(Index: Integer): string;
begin
   if Cardinal(Index) >= Cardinal(FCount) then
      Error(@SListIndexError, Index);
   Result := FList^[Index].FString;
end;

If the range check fails, Error triggers an exception and the last line won’t ever be reached, but the compiler can’t know that, so it compiles everything assuming the last line can be reached even after an Error, thus has to save the parameters, hence the stack juggling.

(the fix to the juggling? use an “else”)

Apart from that issue, there remains in TStringList.Get a range check and an extra UStrAsg call.

And UStrAsg itself includes an atomic incrementation instruction (bus locking)

lock inc dword ptr [edx-$08]

Even though modern CPUs handle them far more efficiently than their ancestors, atomic instructions still cost quite a bit more than their non-atomic counterparts, and the more multi-threaded you are, the more expensive they get.

Summary

So when all is said and done, the hacky version has:

1 atomic instruction (1 UStrAsg)
1 call

While the non-hacky version has

3 atomic instruction (2 UStrAsg, 1 UStrClr)
4 calls (2 UStrAsg, 1 UStrClr, 1 TStringList.Get)
1 exception frame (free in 64bit, not free in 32bit)
higher register pressure in UnifyAssignString
1 range check
a lot of register juggling (especially in 32bit)

So while “cracking” TStringList isn’t safe, depending on the circumstances, you can use it to achieve a boost for very little extra complexity.