- DelphiTools - https://www.delphitools.info -

The limitations of Delphi’s “inline”

Sometimes, the most simple-looking code can cause the Delphi compiler to stumble.

I bumped on such a case recently, and simplified it to a bare-bones version that still exhibits the issue:

type
   TFloatRec = record
      private
         Field : Double;
      public
         function RecGet : Double; inline;
   end;

   TMyClass = class
      private
         FRec : TFloatRec;
      public
         function Get : Double; virtual;
   end;

function TFloatRec.Get : Double;
begin
   Result:=Field; // here you could do a computation instead
end;

function TMyClass.Get : Double;
begin
   Result:=FRec.RecGet;
end;

Basically all you have are trivial functions that return the value of a floating-point field.

Given the above, for the TMyClass.Get method, the optimal codegen would look just like

fld qword ptr [eax+8]
ret

Simple enough, eh? Yet here is what the Delphi XE compiler generates:

Unit1.pas.326: begin
0053A794 83C4F0           add esp,-$10
Unit1.pas.327: Result:=FRec.Get;
0053A797 83C008           add eax,$08
0053A79A 8B10             mov edx,[eax]
0053A79C 89542408         mov [esp+$08],edx
0053A7A0 8B5004           mov edx,[eax+$04]
0053A7A3 8954240C         mov [esp+$0c],edx
0053A7A7 8B442408         mov eax,[esp+$08]
0053A7AB 890424           mov [esp],eax
0053A7AE 8B44240C         mov eax,[esp+$0c]
0053A7B2 89442404         mov [esp+$04],eax
Unit1.pas.328: end;
0053A7B6 DD0424           fld qword ptr [esp]
0053A7B9 83C410           add esp,$10
0053A7BC C3               ret

for the less-asm fluent, a direct pseudo-pascal translation of the above would be

var
   p : PDouble;
   temp1, temp2 : Double;
begin
   p:[email protected];
   temp1:=p^;
   temp2:=temp1;
   Result:=temp2;
end;

And if TMyClass.Get is not virtual, but a static method with “inline”, you get the above with a third temp3” Double (ie. it will perform even worse).

The above trips to temporaries aren’t innocuous, because those temporaries are in the stack, and result in stalls as the CPU pipeline waits for the roundtrips to L1 memory cache to happen. In practice, a single of those stalls can take as much time as half a dozen floating operations.

To get rid of the temporaries, there are two options: you can manually inline everything (the RecGet & the Get) to get rid of the temporaries, of course, that doesn’t sit too well with encapsulation, or with virtual calls for that matter.

Or you can use inline-asm instead, a single instruction of asm being enough, and even with calls betweens the functions, it will be running circles around the Delphi compiler’s “inline” output.