- DelphiTools - https://www.delphitools.info -

# A Look at Improved Inlining in Delphi XE6

First noticed by dewresearch [1], Delphi XE6 introduced a new optimization for inlined functions that return a floating-point value.

Here is an exploration of what was improved… and what was not improved.

When inlining was introduced in Delphi, one limitation was that functions returning a floating point values would incur an unnecessary round-trip to the stack, which for short/simple math functions could sometimes not just negate the benefits of inlining, but make the performance worse.

With XE6, that roundtrip seems to have been optimized away in some cases.

#### A look at the most trivial case

Here are test cases for the usual conventions for functions returning a floating point value:

```function GetFloat : Double;
begin
Result := 0;
end;

function GetFloatInline : Double; inline;
begin
Result := 0;
end;

procedure GetFloatVar(var Result : Double);
begin
Result := 0;
end;

procedure GetFloatVarInline(var Result : Double); inline;
begin
Result := 0;
end;```

The procedure variants traditionally offered higher performance than function, by eliminating all round-trips to the stack.

Here is the Delphi XE6 compiler output for calls to those functions, the inlined variants are nice and tight:

```Unit1.pas.48: f := GetFloat;
005D7357 E8DCFFFFFF       call GetFloat
005D735C DD1C24           fstp qword ptr [esp]
005D735F 9B               wait
Unit1.pas.49: f := GetFloatInline;
005D7360 33C0             xor eax,eax
005D7362 890424           mov [esp],eax
005D7365 89442404         mov [esp+\$04],eax
Unit1.pas.50: GetFloatVar(f);
005D7369 8BC4             mov eax,esp
005D736B E8DCFFFFFF       call GetFloatVar
Unit1.pas.51: GetFloatVarInline(f);
005D7370 33C0             xor eax,eax
005D7372 890424           mov [esp],eax
005D7375 89442404         mov [esp+\$04],eax```

By comparison, here is Delphi XE compiler output for the GetFloatInline call. The output is unchanged for the other calls.

```Unit1.pas.49: f := GetFloatInline;
004AB664 33C0             xor eax,eax
004AB666 89442408         mov [esp+\$08],eax
004AB66A 8944240C         mov [esp+\$0c],eax
004AB66E 8B442408         mov eax,[esp+\$08]   // stack juggling
004AB672 890424           mov [esp],eax       // stack juggling
004AB675 8B44240C         mov eax,[esp+\$0c]   // stack juggling
004AB679 89442404         mov [esp+\$04],eax   // stack juggling
```

And that’s just for the call (you have other induced overhead in the pre-amble and post-amble), and just for a trivial function returning a constant.

So Delphi XE6 compiler demonstrates a clear advantage.

Well, nothing changed, and the procedure variant still has the edge, a function returning a float will still exhibit the stack round-trip in XE6 in the same way as Delphi XE:

```Unit1.pas.24: function GetFloat : Double;
Unit1.pas.25: begin
Unit1.pas.26: Result := 0;
005D733B 33C0             xor eax,eax
005D733D 890424           mov [esp],eax
005D7340 89442404         mov [esp+\$04],eax
Unit1.pas.27: end;
005D7344 DD0424           fld qword ptr [esp]
005D7347 59               pop ecx
005D7348 5A               pop edx
005D7349 C3               ret

Unit1.pas.34: function GetFloatVar(var Result : Double);
Unit1.pas.35: begin
Unit1.pas.36: Result := 0;
005D734C 33D2             xor edx,edx
005D734E 8910             mov [eax],edx
005D7350 895004           mov [eax+\$04],edx
Unit1.pas.37: end;
005D7353 C3               ret
```

#### A look at a marginally more complex case

What happens when the function is slightly more complex?

```function Add(const a, b : Double) : Double;
begin
Result := a+b;
end;```

Well, the non-inlined form still compiles rather inefficiently in both XE & XE6

```Unit1.pas.45: begin
005D7354 55               push ebp
005D7355 8BEC             mov ebp,esp
Unit1.pas.46: Result := a+b;
005D735A DD4510           fld qword ptr [ebp+\$10]
005D735D DC4508           fadd qword ptr [ebp+\$08]
005D7360 DD5DF0           fstp qword ptr [ebp-\$10]
005D7363 9B               wait
005D7364 DD45F0           fld qword ptr [ebp-\$10]
005D7367 DD5DF8           fstp qword ptr [ebp-\$08]
005D736A 9B               wait
Unit1.pas.47: end;
005D736B DD45F8           fld qword ptr [ebp-\$08]
005D736E 8BE5             mov esp,ebp
005D7370 5D               pop ebp
005D7371 C21000           ret \$0010```

By reference, the optimal form would involve just three instructions

```fld a
ret```

What about inlining? Well things changed, but not all for the best…

Here is the inefficient inlining in Delphi XE

```Unit1.pas.53: f := Add(a, b);
004AB65B DD442408         fld qword ptr [esp+\$08]
004AB65F DC442410         fadd qword ptr [esp+\$10]
004AB663 DD5C2418         fstp qword ptr [esp+\$18]
004AB667 9B               wait
004AB668 8B442418         mov eax,[esp+\$18]
004AB66C 890424           mov [esp],eax
004AB66F 8B44241C         mov eax,[esp+\$1c]
004AB673 89442404         mov [esp+\$04],eax```

and here is the inefficient inlining in Delphi XE6

```Unit1.pas.53: f := Add(a, b);
005D7357 DD442408         fld qword ptr [esp+\$08]
005D735B DC442410         fadd qword ptr [esp+\$10]
005D735F DD5C2418         fstp qword ptr [esp+\$18]
005D7363 9B               wait
005D7364 DD442418         fld qword ptr [esp+\$18]
005D7368 DD1C24           fstp qword ptr [esp]
005D736B 9B               wait```

So the stack juggling is still there, except that instead of being handled by integer instructions, it’s now handled by FPU instructions, along with a pointless wait instruction.

If your code is already bottle-necked by the FPU, this just won’t help…

#### Conclusion

The new function inlining in XE6 can provide some improvements over XE, but it can also result in less efficient code in a floating-point heavy context.

It also means that the need for using procedure-with-var-for-result instead of functions has – alas – not been eliminated by XE6, and there may be just as many cases in which performance goes up as cases in which performance will go down.