Sometimes, the most simple-looking code can cause the Delphi compiler to stumble.
I bumped on such a case recently, and simplified it to a bare-bones version that still exhibits the issue:
type TFloatRec = record private Field : Double; public function RecGet : Double; inline; end; TMyClass = class private FRec : TFloatRec; public function Get : Double; virtual; end; function TFloatRec.Get : Double; begin Result:=Field; // here you could do a computation instead end; function TMyClass.Get : Double; begin Result:=FRec.RecGet; end;
Basically all you have are trivial functions that return the value of a floating-point field.
Given the above, for the TMyClass.Get method, the optimal codegen would look just like
fld qword ptr [eax+8] ret
Simple enough, eh? Yet here is what the Delphi XE compiler generates:
Unit1.pas.326: begin 0053A794 83C4F0 add esp,-$10 Unit1.pas.327: Result:=FRec.Get; 0053A797 83C008 add eax,$08 0053A79A 8B10 mov edx,[eax] 0053A79C 89542408 mov [esp+$08],edx 0053A7A0 8B5004 mov edx,[eax+$04] 0053A7A3 8954240C mov [esp+$0c],edx 0053A7A7 8B442408 mov eax,[esp+$08] 0053A7AB 890424 mov [esp],eax 0053A7AE 8B44240C mov eax,[esp+$0c] 0053A7B2 89442404 mov [esp+$04],eax Unit1.pas.328: end; 0053A7B6 DD0424 fld qword ptr [esp] 0053A7B9 83C410 add esp,$10 0053A7BC C3 ret
for the less-asm fluent, a direct pseudo-pascal translation of the above would be
var p : PDouble; temp1, temp2 : Double; begin p:=@FRec.Field; temp1:=p^; temp2:=temp1; Result:=temp2; end;
And if TMyClass.Get is not virtual, but a static method with “inline”, you get the above with a third “temp3” Double (ie. it will perform even worse).
The above trips to temporaries aren’t innocuous, because those temporaries are in the stack, and result in stalls as the CPU pipeline waits for the roundtrips to L1 memory cache to happen. In practice, a single of those stalls can take as much time as half a dozen floating operations.
To get rid of the temporaries, there are two options: you can manually inline everything (the RecGet & the Get) to get rid of the temporaries, of course, that doesn’t sit too well with encapsulation, or with virtual calls for that matter.
Or you can use inline-asm instead, a single instruction of asm being enough, and even with calls betweens the functions, it will be running circles around the Delphi compiler’s “inline” output.