- DelphiTools - https://www.delphitools.info -

A Look at Improved Inlining in Delphi XE6

Venerable 8087 FPU co-processorFirst noticed by dewresearch [1], Delphi XE6 introduced a new optimization for inlined functions that return a floating-point value.

Here is an exploration of what was improved… and what was not improved.

When inlining was introduced in Delphi, one limitation was that functions returning a floating point values would incur an unnecessary round-trip to the stack, which for short/simple math functions could sometimes not just negate the benefits of inlining, but make the performance worse.

With XE6, that roundtrip seems to have been optimized away in some cases.

A look at the most trivial case

Here are test cases for the usual conventions for functions returning a floating point value:

function GetFloat : Double;
begin
   Result := 0;
end;

function GetFloatInline : Double; inline;
begin
   Result := 0;
end;

procedure GetFloatVar(var Result : Double);
begin
   Result := 0;
end;

procedure GetFloatVarInline(var Result : Double); inline;
begin
   Result := 0;
end;

The procedure variants traditionally offered higher performance than function, by eliminating all round-trips to the stack.

Here is the Delphi XE6 compiler output for calls to those functions, the inlined variants are nice and tight:

Unit1.pas.48: f := GetFloat;
005D7357 E8DCFFFFFF       call GetFloat
005D735C DD1C24           fstp qword ptr [esp]
005D735F 9B               wait 
Unit1.pas.49: f := GetFloatInline;
005D7360 33C0             xor eax,eax
005D7362 890424           mov [esp],eax
005D7365 89442404         mov [esp+$04],eax
Unit1.pas.50: GetFloatVar(f);
005D7369 8BC4             mov eax,esp
005D736B E8DCFFFFFF       call GetFloatVar
Unit1.pas.51: GetFloatVarInline(f);
005D7370 33C0             xor eax,eax
005D7372 890424           mov [esp],eax
005D7375 89442404         mov [esp+$04],eax

By comparison, here is Delphi XE compiler output for the GetFloatInline call. The output is unchanged for the other calls.

Unit1.pas.49: f := GetFloatInline;
004AB664 33C0             xor eax,eax
004AB666 89442408         mov [esp+$08],eax
004AB66A 8944240C         mov [esp+$0c],eax
004AB66E 8B442408         mov eax,[esp+$08]   // stack juggling
004AB672 890424           mov [esp],eax       // stack juggling 
004AB675 8B44240C         mov eax,[esp+$0c]   // stack juggling
004AB679 89442404         mov [esp+$04],eax   // stack juggling

And that’s just for the call (you have other induced overhead in the pre-amble and post-amble), and just for a trivial function returning a constant.

So Delphi XE6 compiler demonstrates a clear advantage.

What about the non-inlined functions?

Well, nothing changed, and the procedure variant still has the edge, a function returning a float will still exhibit the stack round-trip in XE6 in the same way as Delphi XE:

Unit1.pas.24: function GetFloat : Double;
Unit1.pas.25: begin
005D7338 83C4F8           add esp,-$08
Unit1.pas.26: Result := 0;
005D733B 33C0             xor eax,eax
005D733D 890424           mov [esp],eax
005D7340 89442404         mov [esp+$04],eax
Unit1.pas.27: end;
005D7344 DD0424           fld qword ptr [esp]
005D7347 59               pop ecx
005D7348 5A               pop edx
005D7349 C3               ret 

Unit1.pas.34: function GetFloatVar(var Result : Double);
Unit1.pas.35: begin
Unit1.pas.36: Result := 0;
005D734C 33D2             xor edx,edx
005D734E 8910             mov [eax],edx
005D7350 895004           mov [eax+$04],edx
Unit1.pas.37: end;
005D7353 C3               ret 

Next: A marginally more complex case & Conclusion [2]

Previous: A look at a trivial case. [3]

A look at a marginally more complex case

What happens when the function is slightly more complex?

function Add(const a, b : Double) : Double;
begin
   Result := a+b;
end;

Well, the non-inlined form still compiles rather inefficiently in both XE & XE6

Unit1.pas.45: begin
005D7354 55               push ebp
005D7355 8BEC             mov ebp,esp
005D7357 83C4F0           add esp,-$10
Unit1.pas.46: Result := a+b;
005D735A DD4510           fld qword ptr [ebp+$10]
005D735D DC4508           fadd qword ptr [ebp+$08]
005D7360 DD5DF0           fstp qword ptr [ebp-$10]
005D7363 9B               wait 
005D7364 DD45F0           fld qword ptr [ebp-$10]
005D7367 DD5DF8           fstp qword ptr [ebp-$08]
005D736A 9B               wait 
Unit1.pas.47: end;
005D736B DD45F8           fld qword ptr [ebp-$08]
005D736E 8BE5             mov esp,ebp
005D7370 5D               pop ebp
005D7371 C21000           ret $0010

By reference, the optimal form would involve just three instructions

fld a
fadd b
ret

What about inlining? Well things changed, but not all for the best…

Here is the inefficient inlining in Delphi XE

Unit1.pas.53: f := Add(a, b);
004AB65B DD442408         fld qword ptr [esp+$08]
004AB65F DC442410         fadd qword ptr [esp+$10]
004AB663 DD5C2418         fstp qword ptr [esp+$18]
004AB667 9B               wait 
004AB668 8B442418         mov eax,[esp+$18]
004AB66C 890424           mov [esp],eax
004AB66F 8B44241C         mov eax,[esp+$1c]
004AB673 89442404         mov [esp+$04],eax

and here is the inefficient inlining in Delphi XE6

Unit1.pas.53: f := Add(a, b);
005D7357 DD442408         fld qword ptr [esp+$08]
005D735B DC442410         fadd qword ptr [esp+$10]
005D735F DD5C2418         fstp qword ptr [esp+$18]
005D7363 9B               wait 
005D7364 DD442418         fld qword ptr [esp+$18]
005D7368 DD1C24           fstp qword ptr [esp]
005D736B 9B               wait

So the stack juggling is still there, except that instead of being handled by integer instructions, it’s now handled by FPU instructions, along with a pointless wait instruction.

If your code is already bottle-necked by the FPU, this just won’t help…

Conclusion

The new function inlining in XE6 can provide some improvements over XE, but it can also result in less efficient code in a floating-point heavy context.

It also means that the need for using procedure-with-var-for-result instead of functions has – alas – not been eliminated by XE6, and there may be just as many cases in which performance goes up as cases in which performance will go down.