The limitations of Delphi’s “inline”

Sometimes, the most simple-looking code can cause the Delphi compiler to stumble.

I bumped on such a case recently, and simplified it to a bare-bones version that still exhibits the issue:

type
   TFloatRec = record
      private
         Field : Double;
      public
         function RecGet : Double; inline;
   end;

   TMyClass = class
      private
         FRec : TFloatRec;
      public
         function Get : Double; virtual;
   end;

function TFloatRec.Get : Double;
begin
   Result:=Field; // here you could do a computation instead
end;

function TMyClass.Get : Double;
begin
   Result:=FRec.RecGet;
end;

Basically all you have are trivial functions that return the value of a floating-point field.

Given the above, for the TMyClass.Get method, the optimal codegen would look just like

fld qword ptr [eax+8]
ret

Simple enough, eh? Yet here is what the Delphi XE compiler generates:

Unit1.pas.326: begin
0053A794 83C4F0           add esp,-$10
Unit1.pas.327: Result:=FRec.Get;
0053A797 83C008           add eax,$08
0053A79A 8B10             mov edx,[eax]
0053A79C 89542408         mov [esp+$08],edx
0053A7A0 8B5004           mov edx,[eax+$04]
0053A7A3 8954240C         mov [esp+$0c],edx
0053A7A7 8B442408         mov eax,[esp+$08]
0053A7AB 890424           mov [esp],eax
0053A7AE 8B44240C         mov eax,[esp+$0c]
0053A7B2 89442404         mov [esp+$04],eax
Unit1.pas.328: end;
0053A7B6 DD0424           fld qword ptr [esp]
0053A7B9 83C410           add esp,$10
0053A7BC C3               ret

for the less-asm fluent, a direct pseudo-pascal translation of the above would be

var
   p : PDouble;
   temp1, temp2 : Double;
begin
   p:[email protected];
   temp1:=p^;
   temp2:=temp1;
   Result:=temp2;
end;

And if TMyClass.Get is not virtual, but a static method with “inline”, you get the above with a third temp3” Double (ie. it will perform even worse).

The above trips to temporaries aren’t innocuous, because those temporaries are in the stack, and result in stalls as the CPU pipeline waits for the roundtrips to L1 memory cache to happen. In practice, a single of those stalls can take as much time as half a dozen floating operations.

To get rid of the temporaries, there are two options: you can manually inline everything (the RecGet & the Get) to get rid of the temporaries, of course, that doesn’t sit too well with encapsulation, or with virtual calls for that matter.

Or you can use inline-asm instead, a single instruction of asm being enough, and even with calls betweens the functions, it will be running circles around the Delphi compiler’s “inline” output.

3 thoughts on “The limitations of Delphi’s “inline”

  1. If you get rid of the inline, you’ll get even worse code:

    function TMyClass.Get : Double;
    begin
    add esp,-$08
    Result:=FRec.RecGet;
    add eax,$04
    call TFloatRec.RecGet
    fstp qword ptr [esp]
    wait
    fld qword ptr [esp]
    end;
    pop ecx
    pop edx
    ret

    function TFloatRec.RecGet : Double;
    begin
    add esp,-$08
    Result := Field; // here you could do a computation instead
    mov edx,[eax]
    mov [esp],edx
    mov edx,[eax+$04]
    mov [esp+$04],edx
    fld qword ptr [esp]
    end;
    pop ecx
    pop edx
    ret

    All those memory moves come from some problems in the Delphi compiler about generating its FPU/x87 code:
    – it uses plain memory moves for copying one floating point value to another (like if a double were an Int64);
    – the x87 stack is separated from other data, using the x86 stack as a temporary storage space used for conversion;
    – when a function is defined to return a floating-point value, the stack is used as temporary storage for the function body, then the value is loaded from the x86 stack into the x87 stack;
    – the x87 code generator doesn’t share the optimization features of the x86 code generator;
    – there is still “wait” no-op codes generated in Delphi (BC++ does not generate those wait since decades)…

    That’s why most Delphi coders rely on BASM for years, when it deals with floating-point computation speed… or use an external optimized library written in C for fast calculation… remember how AggPas is very fast (faster than GDI+) but 4 time slower than the original Agg code in C…

    I don’t know if the upcoming 64 bit compiler will be better about floating point type handling. AFAIR we were told something about using SSE for handling double types. Could be a good idea.

Comments are closed.