Getting Rid of the Middleman

On this StackOverflow question David Heffernan asked about a hack I’m using in DWScript’s UnifyAssignString.

TStringListCracker

The code of the function looks like:

procedure UnifyAssignString(const fromStr : UnicodeString; var toStr : UnicodeString);
var
   i : Integer;
   sl : TUnifierStringList;
begin
   if fromStr = '' then
      toStr := ''
   else begin
      i := Ord(fromStr[1]) and High(vCharStrings);
      sl := vCharStrings[i];
      sl.FLock.Enter;
      i := sl.AddObject(fromStr, nil);
      toStr := TStringListCracker(sl).FList[i].FString; // HACK HERE
      sl.FLock.Leave;
   end;
end;

The non-hacky variant would be to have that line just be

toStr := sl[i];

but if you use that simpler form, you could see a performance drop by up to 45% (Delphi XE and XE2 32bit) or 38% (Delphi XE2 64bit).

Ouch! Why is that?

The hacky version

The hacky version compiles to a single UStrAsg, the internal function for string assignment which takes care of the reference-counting. So it’s just:

dwsUtils.pas.487: toStr:=TStringListCracker(sl).FList[i].FString;
00511594 8B0424           mov eax,[esp]
00511597 8B532C           mov edx,[ebx+$2c]
0051159A 8B14F2           mov edx,[edx+esi*8]
0051159D E8A659EFFF       call @UStrAsg

i having been returned by AddObject, no range-check is required.

(note that TStrings.AddObject is called directly because TStrings.Add is just an indirection to AddObject)

What about the “toStr := sl[i]” version?

The property is just syntax sugar on a getter method, so what is actually compiled is  “toStr := sl.Get(i)“, which itself, being a virtual call, looks like

toStr := sl.VMT[ TStrings_Get ]( sl, i );

However the virtual call to TStringList.Get on its own costs only a tiny bitsy of extra CPU cycles, though it also prevents inlining (at least, until the compiler gets de-virtualisation optimizations).

But there are other factors at work, the virtual call on its own can’t come close to explaining the performance differential, as after all, the rest of the code UnifyAsstringString is far from trivial (with a critical section and a binary search…)

First, it’s a function returning a string, which is a managed type, so it’s actually compiled to something like:

try
   temp := sl.getValue( i );
   UStrAsg( toStr, temp );
finally
   UStrClr( temp );
end;

Second, TStringList.GetValue itself is compiled to

TStringList.Get:
00448204 53               push ebx
00448205 56               push esi
00448206 57               push edi
00448207 8BF9             mov edi,ecx
00448209 8BF2             mov esi,edx
0044820B 8BD8             mov ebx,eax
0044820D 3B7330           cmp esi,[ebx+$30]
00448210 720F             jb $00448221
00448212 8B1544EA5100     mov edx,[$0051ea44]
00448218 8BCE             mov ecx,esi
0044821A 8BC3             mov eax,ebx
0044821C E8C3E3FFFF       call TStrings.Error
00448221 8BC7             mov eax,edi
00448223 8B532C           mov edx,[ebx+$2c]
00448226 8B14F2           mov edx,[edx+esi*8]
00448229 E81AEDFBFF       call @UStrAsg
0044822E 5F               pop edi
0044822F 5E               pop esi
00448230 5B               pop ebx
00448231 C3               ret

That’s quite a lot of code, a fair share of it is spent juggling registers because TStringList.Get is implemented as

function TStringList.Get(Index: Integer): string;
begin
   if Cardinal(Index) >= Cardinal(FCount) then
      Error(@SListIndexError, Index);
   Result := FList^[Index].FString;
end;

If the range check fails, Error triggers an exception and the last line won’t ever be reached, but the compiler can’t know that, so it compiles everything assuming the last line can be reached even after an Error, thus has to save the parameters, hence the stack juggling.

(the fix to the juggling? use an “else”)

Apart from that issue, there remains in TStringList.Get a range check and an extra UStrAsg call.

And UStrAsg itself includes an atomic incrementation instruction (bus locking)

lock inc dword ptr [edx-$08]

Even though modern CPUs handle them far more efficiently than their ancestors, atomic instructions still cost quite a bit more than their non-atomic counterparts, and the more multi-threaded you are, the more expensive they get.

Summary

So when all is said and done, the hacky version has:

  • 1 atomic instruction (1 UStrAsg)
  • 1 call

While the non-hacky version has

  • 3 atomic instruction (2 UStrAsg, 1 UStrClr)
  • 4 calls (2 UStrAsg, 1 UStrClr, 1 TStringList.Get)
  • 1 exception frame (free in 64bit, not free in 32bit)
  • higher register pressure in UnifyAssignString
  • 1 range check
  • a lot of register juggling (especially in 32bit)

So while “cracking” TStringList isn’t safe, depending on the circumstances, you can use it to achieve a boost for very little extra complexity.

9 thoughts on “Getting Rid of the Middleman

  1. The same remark works for TList and TObjectList. For TStringList, you have to “crack” it, but for TList and TObjectList, you can use its List: PPointerList property.
    In our mORMot framework code, we use seldom the TList.List[] array of pointers instead of TList.Items[], when you are sure the index won’t be out of range:
    for i := 0 to fSessions.Count-1 do with TAuthSession(fSessions.List[i]) do if (fIDCardinal=aSessionID) and (fUser.LogonName=aUserName) then …
    Such code compiles very well, do not make any call nor run any complex inline statement. Sometimes, the “with … do” statement will save execution time when checking for the parameter values.
    In all cases, you’ll better know a bit about assembler, and use Alt-F2 to ensure that your code is better than the initial version. Using a real profiler, as you do, is mandatory to make real speed improvements.

  2. Please remove my previous two comments, for some reason, wordpress isn’t working properly… ):

    I have looked at DWScript code and all I can say is that for the most part, a lot of hardcore optimizations are made, this post emphasises that even more.

    “toStr&nbp;:= sl.Get(i)“ <- missed a "s" from " " "secon, TStringList.GetValue itself..." I think you meant "second" P.S. I still can't believe the speed deferences of over 35% => mind blowing

  3. @A. Bouchez
    TList only has some of the issues, as TList.Get isn’t virtual, and the returned value isn’t managed. So TList/TObjectList don’t trigger exception frames nor the extra bus locks that hurt TStringList.Get so bad.

    But yes, .List is accessible without having to crack the classes and can be quite useful. Even better, it’s a pointer to an array, so rather than accessing for each item, you can store a copy in a local variable, which allows the compiler to allocate it to a register for even better performance (that’s done in a few places in DWScript).

  4. Hi Eric, can you keep and document not optimised (hacked) code in source for future porting to FreePascal, ARM compiler, etc.

  5. For the record, using older Delphi compilers, the hacked access to items in TStringList is only 10% faster then the original version (tested with D6, D7 and BDS 2006).

  6. I still can’t install DWSScript on Delphi XE 2 i don’t know either is my version broken . Did anyone encountered this problem before ?

Comments are closed.