On this StackOverflow question  David Heffernan asked about a hack I’m using in DWScript’s UnifyAssignString.
The code of the function looks like:
procedure UnifyAssignString(const fromStr : UnicodeString; var toStr : UnicodeString); var i : Integer; sl : TUnifierStringList; begin if fromStr = '' then toStr := '' else begin i := Ord(fromStr) and High(vCharStrings); sl := vCharStrings[i]; sl.FLock.Enter; i := sl.AddObject(fromStr, nil); toStr := TStringListCracker(sl).FList[i].FString; // HACK HERE sl.FLock.Leave; end; end;
The non-hacky variant would be to have that line just be
toStr := sl[i];
but if you use that simpler form, you could see a performance drop by up to 45% (Delphi XE and XE2 32bit) or 38% (Delphi XE2 64bit).
Ouch! Why is that?
The hacky version
The hacky version compiles to a single UStrAsg, the internal function for string assignment which takes care of the reference-counting. So it’s just:
dwsUtils.pas.487: toStr:=TStringListCracker(sl).FList[i].FString; 00511594 8B0424 mov eax,[esp] 00511597 8B532C mov edx,[ebx+$2c] 0051159A 8B14F2 mov edx,[edx+esi*8] 0051159D E8A659EFFF call @UStrAsg
i having been returned by AddObject, no range-check is required.
(note that TStrings.AddObject is called directly because TStrings.Add is just an indirection to AddObject)
What about the “toStr := sl[i]” version?
The property is just syntax sugar on a getter method, so what is actually compiled is “toStr := sl.Get(i)“, which itself, being a virtual call, looks like
toStr := sl.VMT[ TStrings_Get ]( sl, i );
However the virtual call to TStringList.Get on its own costs only a tiny bitsy of extra CPU cycles, though it also prevents inlining (at least, until the compiler gets de-virtualisation optimizations).
But there are other factors at work, the virtual call on its own can’t come close to explaining the performance differential, as after all, the rest of the code UnifyAsstringString is far from trivial (with a critical section and a binary search…)
First, it’s a function returning a string, which is a managed type, so it’s actually compiled to something like:
try temp := sl.getValue( i ); UStrAsg( toStr, temp ); finally UStrClr( temp ); end;
Second, TStringList.GetValue itself is compiled to
TStringList.Get: 00448204 53 push ebx 00448205 56 push esi 00448206 57 push edi 00448207 8BF9 mov edi,ecx 00448209 8BF2 mov esi,edx 0044820B 8BD8 mov ebx,eax 0044820D 3B7330 cmp esi,[ebx+$30] 00448210 720F jb $00448221 00448212 8B1544EA5100 mov edx,[$0051ea44] 00448218 8BCE mov ecx,esi 0044821A 8BC3 mov eax,ebx 0044821C E8C3E3FFFF call TStrings.Error 00448221 8BC7 mov eax,edi 00448223 8B532C mov edx,[ebx+$2c] 00448226 8B14F2 mov edx,[edx+esi*8] 00448229 E81AEDFBFF call @UStrAsg 0044822E 5F pop edi 0044822F 5E pop esi 00448230 5B pop ebx 00448231 C3 ret
That’s quite a lot of code, a fair share of it is spent juggling registers because TStringList.Get is implemented as
function TStringList.Get(Index: Integer): string; begin if Cardinal(Index) >= Cardinal(FCount) then Error(@SListIndexError, Index); Result := FList^[Index].FString; end;
If the range check fails, Error triggers an exception and the last line won’t ever be reached, but the compiler can’t know that, so it compiles everything assuming the last line can be reached even after an Error, thus has to save the parameters, hence the stack juggling.
(the fix to the juggling? use an “else”)
Apart from that issue, there remains in TStringList.Get a range check and an extra UStrAsg call.
And UStrAsg itself includes an atomic incrementation instruction (bus locking)
lock inc dword ptr [edx-$08]
Even though modern CPUs handle them far more efficiently than their ancestors, atomic instructions still cost quite a bit more than their non-atomic counterparts, and the more multi-threaded you are, the more expensive they get.
So when all is said and done, the hacky version has:
- 1 atomic instruction (1 UStrAsg)
- 1 call
While the non-hacky version has
- 3 atomic instruction (2 UStrAsg, 1 UStrClr)
- 4 calls (2 UStrAsg, 1 UStrClr, 1 TStringList.Get)
- 1 exception frame (free in 64bit, not free in 32bit)
- higher register pressure in UnifyAssignString
- 1 range check
- a lot of register juggling (especially in 32bit)
So while “cracking” TStringList isn’t safe, depending on the circumstances, you can use it to achieve a boost for very little extra complexity.