A Look at Improved Inlining in Delphi XE6

Venerable 8087 FPU co-processorFirst noticed by dewresearch [1], Delphi XE6 introduced a new optimization for inlined functions that return a floating-point value.

Here is an exploration of what was improved… and what was not improved.

When inlining was introduced in Delphi, one limitation was that functions returning a floating point values would incur an unnecessary round-trip to the stack, which for short/simple math functions could sometimes not just negate the benefits of inlining, but make the performance worse.

With XE6, that roundtrip seems to have been optimized away in some cases.

A look at the most trivial case

Here are test cases for the usual conventions for functions returning a floating point value:

function GetFloat : Double;
   Result := 0;

function GetFloatInline : Double; inline;
   Result := 0;

procedure GetFloatVar(var Result : Double);
   Result := 0;

procedure GetFloatVarInline(var Result : Double); inline;
   Result := 0;

The procedure variants traditionally offered higher performance than function, by eliminating all round-trips to the stack.

Here is the Delphi XE6 compiler output for calls to those functions, the inlined variants are nice and tight:

Unit1.pas.48: f := GetFloat;
005D7357 E8DCFFFFFF       call GetFloat
005D735C DD1C24           fstp qword ptr [esp]
005D735F 9B               wait 
Unit1.pas.49: f := GetFloatInline;
005D7360 33C0             xor eax,eax
005D7362 890424           mov [esp],eax
005D7365 89442404         mov [esp+$04],eax
Unit1.pas.50: GetFloatVar(f);
005D7369 8BC4             mov eax,esp
005D736B E8DCFFFFFF       call GetFloatVar
Unit1.pas.51: GetFloatVarInline(f);
005D7370 33C0             xor eax,eax
005D7372 890424           mov [esp],eax
005D7375 89442404         mov [esp+$04],eax

By comparison, here is Delphi XE compiler output for the GetFloatInline call. The output is unchanged for the other calls.

Unit1.pas.49: f := GetFloatInline;
004AB664 33C0             xor eax,eax
004AB666 89442408         mov [esp+$08],eax
004AB66A 8944240C         mov [esp+$0c],eax
004AB66E 8B442408         mov eax,[esp+$08]   // stack juggling
004AB672 890424           mov [esp],eax       // stack juggling 
004AB675 8B44240C         mov eax,[esp+$0c]   // stack juggling
004AB679 89442404         mov [esp+$04],eax   // stack juggling

And that’s just for the call (you have other induced overhead in the pre-amble and post-amble), and just for a trivial function returning a constant.

So Delphi XE6 compiler demonstrates a clear advantage.

What about the non-inlined functions?

Well, nothing changed, and the procedure variant still has the edge, a function returning a float will still exhibit the stack round-trip in XE6 in the same way as Delphi XE:

Unit1.pas.24: function GetFloat : Double;
Unit1.pas.25: begin
005D7338 83C4F8           add esp,-$08
Unit1.pas.26: Result := 0;
005D733B 33C0             xor eax,eax
005D733D 890424           mov [esp],eax
005D7340 89442404         mov [esp+$04],eax
Unit1.pas.27: end;
005D7344 DD0424           fld qword ptr [esp]
005D7347 59               pop ecx
005D7348 5A               pop edx
005D7349 C3               ret 

Unit1.pas.34: function GetFloatVar(var Result : Double);
Unit1.pas.35: begin
Unit1.pas.36: Result := 0;
005D734C 33D2             xor edx,edx
005D734E 8910             mov [eax],edx
005D7350 895004           mov [eax+$04],edx
Unit1.pas.37: end;
005D7353 C3               ret 

A look at a marginally more complex case

What happens when the function is slightly more complex?

function Add(const a, b : Double) : Double;
   Result := a+b;

Well, the non-inlined form still compiles rather inefficiently in both XE & XE6

Unit1.pas.45: begin
005D7354 55               push ebp
005D7355 8BEC             mov ebp,esp
005D7357 83C4F0           add esp,-$10
Unit1.pas.46: Result := a+b;
005D735A DD4510           fld qword ptr [ebp+$10]
005D735D DC4508           fadd qword ptr [ebp+$08]
005D7360 DD5DF0           fstp qword ptr [ebp-$10]
005D7363 9B               wait 
005D7364 DD45F0           fld qword ptr [ebp-$10]
005D7367 DD5DF8           fstp qword ptr [ebp-$08]
005D736A 9B               wait 
Unit1.pas.47: end;
005D736B DD45F8           fld qword ptr [ebp-$08]
005D736E 8BE5             mov esp,ebp
005D7370 5D               pop ebp
005D7371 C21000           ret $0010

By reference, the optimal form would involve just three instructions

fld a
fadd b

What about inlining? Well things changed, but not all for the best…

Here is the inefficient inlining in Delphi XE

Unit1.pas.53: f := Add(a, b);
004AB65B DD442408         fld qword ptr [esp+$08]
004AB65F DC442410         fadd qword ptr [esp+$10]
004AB663 DD5C2418         fstp qword ptr [esp+$18]
004AB667 9B               wait 
004AB668 8B442418         mov eax,[esp+$18]
004AB66C 890424           mov [esp],eax
004AB66F 8B44241C         mov eax,[esp+$1c]
004AB673 89442404         mov [esp+$04],eax

and here is the inefficient inlining in Delphi XE6

Unit1.pas.53: f := Add(a, b);
005D7357 DD442408         fld qword ptr [esp+$08]
005D735B DC442410         fadd qword ptr [esp+$10]
005D735F DD5C2418         fstp qword ptr [esp+$18]
005D7363 9B               wait 
005D7364 DD442418         fld qword ptr [esp+$18]
005D7368 DD1C24           fstp qword ptr [esp]
005D736B 9B               wait

So the stack juggling is still there, except that instead of being handled by integer instructions, it’s now handled by FPU instructions, along with a pointless wait instruction.

If your code is already bottle-necked by the FPU, this just won’t help…


The new function inlining in XE6 can provide some improvements over XE, but it can also result in less efficient code in a floating-point heavy context.

It also means that the need for using procedure-with-var-for-result instead of functions has – alas – not been eliminated by XE6, and there may be just as many cases in which performance goes up as cases in which performance will go down.