First noticed by dewresearch [1], Delphi XE6 introduced a new optimization for inlined functions that return a floating-point value.
Here is an exploration of what was improved… and what was not improved.
When inlining was introduced in Delphi, one limitation was that functions returning a floating point values would incur an unnecessary round-trip to the stack, which for short/simple math functions could sometimes not just negate the benefits of inlining, but make the performance worse.
With XE6, that roundtrip seems to have been optimized away in some cases.
A look at the most trivial case
Here are test cases for the usual conventions for functions returning a floating point value:
function GetFloat : Double; begin Result := 0; end; function GetFloatInline : Double; inline; begin Result := 0; end; procedure GetFloatVar(var Result : Double); begin Result := 0; end; procedure GetFloatVarInline(var Result : Double); inline; begin Result := 0; end;
The procedure variants traditionally offered higher performance than function, by eliminating all round-trips to the stack.
Here is the Delphi XE6 compiler output for calls to those functions, the inlined variants are nice and tight:
Unit1.pas.48: f := GetFloat; 005D7357 E8DCFFFFFF call GetFloat 005D735C DD1C24 fstp qword ptr [esp] 005D735F 9B wait Unit1.pas.49: f := GetFloatInline; 005D7360 33C0 xor eax,eax 005D7362 890424 mov [esp],eax 005D7365 89442404 mov [esp+$04],eax Unit1.pas.50: GetFloatVar(f); 005D7369 8BC4 mov eax,esp 005D736B E8DCFFFFFF call GetFloatVar Unit1.pas.51: GetFloatVarInline(f); 005D7370 33C0 xor eax,eax 005D7372 890424 mov [esp],eax 005D7375 89442404 mov [esp+$04],eax
By comparison, here is Delphi XE compiler output for the GetFloatInline call. The output is unchanged for the other calls.
Unit1.pas.49: f := GetFloatInline; 004AB664 33C0 xor eax,eax 004AB666 89442408 mov [esp+$08],eax 004AB66A 8944240C mov [esp+$0c],eax 004AB66E 8B442408 mov eax,[esp+$08] // stack juggling 004AB672 890424 mov [esp],eax // stack juggling 004AB675 8B44240C mov eax,[esp+$0c] // stack juggling 004AB679 89442404 mov [esp+$04],eax // stack juggling
And that’s just for the call (you have other induced overhead in the pre-amble and post-amble), and just for a trivial function returning a constant.
So Delphi XE6 compiler demonstrates a clear advantage.
What about the non-inlined functions?
Well, nothing changed, and the procedure variant still has the edge, a function returning a float will still exhibit the stack round-trip in XE6 in the same way as Delphi XE:
Unit1.pas.24: function GetFloat : Double; Unit1.pas.25: begin 005D7338 83C4F8 add esp,-$08 Unit1.pas.26: Result := 0; 005D733B 33C0 xor eax,eax 005D733D 890424 mov [esp],eax 005D7340 89442404 mov [esp+$04],eax Unit1.pas.27: end; 005D7344 DD0424 fld qword ptr [esp] 005D7347 59 pop ecx 005D7348 5A pop edx 005D7349 C3 ret Unit1.pas.34: function GetFloatVar(var Result : Double); Unit1.pas.35: begin Unit1.pas.36: Result := 0; 005D734C 33D2 xor edx,edx 005D734E 8910 mov [eax],edx 005D7350 895004 mov [eax+$04],edx Unit1.pas.37: end; 005D7353 C3 ret
Next: A marginally more complex case & Conclusion [2]
Previous: A look at a trivial case. [3]
A look at a marginally more complex case
What happens when the function is slightly more complex?
function Add(const a, b : Double) : Double; begin Result := a+b; end;
Well, the non-inlined form still compiles rather inefficiently in both XE & XE6
Unit1.pas.45: begin 005D7354 55 push ebp 005D7355 8BEC mov ebp,esp 005D7357 83C4F0 add esp,-$10 Unit1.pas.46: Result := a+b; 005D735A DD4510 fld qword ptr [ebp+$10] 005D735D DC4508 fadd qword ptr [ebp+$08] 005D7360 DD5DF0 fstp qword ptr [ebp-$10] 005D7363 9B wait 005D7364 DD45F0 fld qword ptr [ebp-$10] 005D7367 DD5DF8 fstp qword ptr [ebp-$08] 005D736A 9B wait Unit1.pas.47: end; 005D736B DD45F8 fld qword ptr [ebp-$08] 005D736E 8BE5 mov esp,ebp 005D7370 5D pop ebp 005D7371 C21000 ret $0010
By reference, the optimal form would involve just three instructions
fld a fadd b ret
What about inlining? Well things changed, but not all for the best…
Here is the inefficient inlining in Delphi XE
Unit1.pas.53: f := Add(a, b); 004AB65B DD442408 fld qword ptr [esp+$08] 004AB65F DC442410 fadd qword ptr [esp+$10] 004AB663 DD5C2418 fstp qword ptr [esp+$18] 004AB667 9B wait 004AB668 8B442418 mov eax,[esp+$18] 004AB66C 890424 mov [esp],eax 004AB66F 8B44241C mov eax,[esp+$1c] 004AB673 89442404 mov [esp+$04],eax
and here is the inefficient inlining in Delphi XE6
Unit1.pas.53: f := Add(a, b); 005D7357 DD442408 fld qword ptr [esp+$08] 005D735B DC442410 fadd qword ptr [esp+$10] 005D735F DD5C2418 fstp qword ptr [esp+$18] 005D7363 9B wait 005D7364 DD442418 fld qword ptr [esp+$18] 005D7368 DD1C24 fstp qword ptr [esp] 005D736B 9B wait
So the stack juggling is still there, except that instead of being handled by integer instructions, it’s now handled by FPU instructions, along with a pointless wait instruction.
If your code is already bottle-necked by the FPU, this just won’t help…
Conclusion
The new function inlining in XE6 can provide some improvements over XE, but it can also result in less efficient code in a floating-point heavy context.
It also means that the need for using procedure-with-var-for-result instead of functions has – alas – not been eliminated by XE6, and there may be just as many cases in which performance goes up as cases in which performance will go down.