In a Google+ comment to my recent article about inlining in XE6 [1], Leif Uneus posted results from Scimark [2].
It appears that XE6 is about 30% slower than previous versions at least from XE5 to XE for 32bits floating point.
Note that Scimark does not make use of inlining, but does make heavy use of floating-point computations, loops and arrays.
Edit: issue discussed here was reported in QC 124652 [3] (now marked as resolved)
Scimark2 results
Here are the Scimark2 results reported by Leif Uneus (I was able to replicate similar ratios on both an AMD Phenom and an Intel E5 in a XE6 vs XE comparison)
XE6 Win32 Results: Mininum running time = 2,00 seconds Composite Score MFlops: 632,06 FFT Mflops: 297,35 (N=1024) SOR Mflops: 895,01 (100 x 100) MonteCarlo: Mflops: 184,05 Sparse matmult Mflops: 360,58 (N=1000, nz=5000) LU Mflops: 1423,33 (M=100, N=100) XE5 Win32 Results: Mininum running time = 2,00 seconds Composite Score MFlops: 859,98 FFT Mflops: 390,91 (N=1024) SOR Mflops: 1193,53 (100 x 100) MonteCarlo: Mflops: 198,91 Sparse matmult Mflops: 538,50 (N=1000, nz=5000) LU Mflops: 1978,03 (M=100, N=100)
Interestingly enough, the slowdown affects all the algorithms in the Scimark bench, some more than others (Spare matrix-multiplications takes a 50% hit!), but all are affected.
So I investigated the simplest one, which is Successive Over-Relaxation, aka “SOR”.
Investigating SOR
The SOR test features an inner-loop where the bulk of execution time is conveniently found, and a it’s a loop with a single line in it:
Gi[j] := omega_over_four * (Gim1[j] + Gip1[j] + Gi[j - 1] + Gi[j + 1]) + one_minus_omega * Gi[j];
Here is what the Delphi XE compilers generates:
0040FE44 DD04C1 fld qword ptr [ecx+eax*8] 0040FE47 8B75E4 mov esi,[ebp-$1c] 0040FE4A DC04C6 fadd qword ptr [esi+eax*8] 0040FE4D DC44C2F8 fadd qword ptr [edx+eax*8-$08] 0040FE51 DC44C208 fadd qword ptr [edx+eax*8+$08] 0040FE55 DC4DF0 fmul qword ptr [ebp-$10] 0040FE58 DD45E8 fld qword ptr [ebp-$18] 0040FE5B DC0CC2 fmul qword ptr [edx+eax*8] 0040FE5E DEC1 faddp st(1) 0040FE60 DD1CC2 fstp qword ptr [edx+eax*8] 0040FE63 9B wait
And here is the output from Delphi XE6. I bolded the meaningful differences:
0041A872 8B75E4 mov esi,[ebp-$1c] 0041A875 DD04C1 fld qword ptr [ecx+eax*8] 0041A878 DC04C6 fadd qword ptr [esi+eax*8] 0041A87B DC44C2F8 fadd qword ptr [edx+eax*8-$08] 0041A87F DC44C208 fadd qword ptr [edx+eax*8+$08] 0041A883 DC4DF0 fmul qword ptr [ebp-$10] 0041A886 DD45E8 fld qword ptr [ebp-$18] 0041A889 DC0CC2 fmul qword ptr [edx+eax*8] 0041A88C DEC1 faddp st(1) 0041A88E DD5DC8 fstp qword ptr [ebp-$38] 0041A891 9B wait 0041A892 DD45C8 fld qword ptr [ebp-$38] // Stack Juggling 0041A895 DD1CC2 fstp qword ptr [edx+eax*8] // Stack Juggling 0041A898 9B wait // Stack Juggling
The key difference is the unnecessary stack juggling at the end, which looks suspiciously similar to the one uncovered in the XE6 inlining article [1]… coincidence?
(“stack juggling” means that the compiler stores value to the stack, then immediately loads them back)
There is another minor difference at the beginning, which should be negated by any modern CPU’s out-of order execution capability (though XE’s version is probably preferable).
In terms of loop alignment, both XE & XE6 are similary bad, ie. unaligned.
Conclusion
If you were wondered how much unnecessary stack juggling can cost, you now have an answer: 30% on average.
If you’re lucky like the Monte Carlo integration, you’ll only lose 7%, but of course if you’re unlucky like the sparse matrix multiplication, you can lose 50%…
So if you’re doing a lot of 32bits floating-point, it’s probably best to avoid XE6 until this gets fixed.