Delphi XE6 32bits and Scimark

By Eric Grange / May 8, 2014

In a Google+ comment to my recent article about inlining in XE6, Leif Uneus posted results from Scimark.

It appears that XE6 is about 30% slower than previous versions at least from XE5 to XE for 32bits floating point.
Note that Scimark does not make use of inlining, but does make heavy use of floating-point computations, loops and arrays.

Edit: issue discussed here was reported in QC 124652 (now marked as resolved)

Scimark2 results

Here are the Scimark2 results reported by Leif Uneus (I was able to replicate similar ratios on both an AMD Phenom and an Intel E5 in a XE6 vs XE comparison)

XE6 Win32 Results:

Mininum running time = 2,00 seconds
Composite Score MFlops:   632,06
FFT             Mflops:   297,35    (N=1024)
SOR             Mflops:   895,01    (100 x 100)
MonteCarlo:     Mflops:   184,05
Sparse matmult  Mflops:   360,58    (N=1000, nz=5000)
LU              Mflops:  1423,33    (M=100, N=100)

XE5 Win32 Results:

Mininum running time = 2,00 seconds
Composite Score MFlops:   859,98
FFT             Mflops:   390,91    (N=1024)
SOR             Mflops:  1193,53    (100 x 100)
MonteCarlo:     Mflops:   198,91
Sparse matmult  Mflops:   538,50    (N=1000, nz=5000)
LU              Mflops:  1978,03    (M=100, N=100)

Interestingly enough, the slowdown affects all the algorithms in the Scimark bench, some more than others (Spare matrix-multiplications takes a 50% hit!), but all are affected.

So I investigated the simplest one, which is Successive Over-Relaxation, aka “SOR”.

Investigating SOR

The SOR test features an inner-loop where the bulk of execution time is conveniently found, and a it’s a loop with a single line in it:

Gi[j] :=   omega_over_four * (Gim1[j] + Gip1[j] + Gi[j - 1] + Gi[j + 1])
         + one_minus_omega * Gi[j];

Here is what the Delphi XE compilers generates:

0040FE44 DD04C1           fld qword ptr [ecx+eax*8]
0040FE47 8B75E4           mov esi,[ebp-$1c]
0040FE4A DC04C6           fadd qword ptr [esi+eax*8]
0040FE4D DC44C2F8         fadd qword ptr [edx+eax*8-$08]
0040FE51 DC44C208         fadd qword ptr [edx+eax*8+$08]
0040FE55 DC4DF0           fmul qword ptr [ebp-$10]
0040FE58 DD45E8           fld qword ptr [ebp-$18]
0040FE5B DC0CC2           fmul qword ptr [edx+eax*8]
0040FE5E DEC1             faddp st(1)
0040FE60 DD1CC2           fstp qword ptr [edx+eax*8]
0040FE63 9B               wait

And here is the output from Delphi XE6. I bolded the meaningful differences:

0041A872 8B75E4           mov esi,[ebp-$1c]                
0041A875 DD04C1           fld qword ptr [ecx+eax*8]
0041A878 DC04C6           fadd qword ptr [esi+eax*8]
0041A87B DC44C2F8         fadd qword ptr [edx+eax*8-$08]
0041A87F DC44C208         fadd qword ptr [edx+eax*8+$08]
0041A883 DC4DF0           fmul qword ptr [ebp-$10]
0041A886 DD45E8           fld qword ptr [ebp-$18]
0041A889 DC0CC2           fmul qword ptr [edx+eax*8]
0041A88C DEC1             faddp st(1)
0041A88E DD5DC8           fstp qword ptr [ebp-$38]
0041A891 9B               wait 
0041A892 DD45C8           fld qword ptr [ebp-$38]       // Stack Juggling
0041A895 DD1CC2           fstp qword ptr [edx+eax*8]    // Stack Juggling
0041A898 9B               wait                          // Stack Juggling

The key difference is the unnecessary stack juggling at the end, which looks suspiciously similar to the one uncovered in the XE6 inlining article… coincidence?

(“stack juggling” means that the compiler stores value to the stack, then immediately loads them back)

There is another minor difference at the beginning, which should be negated by any modern CPU’s out-of order execution capability (though XE’s version is probably preferable).

In terms of loop alignment, both XE & XE6 are similary bad, ie. unaligned.

Conclusion

If you were wondered how much unnecessary stack juggling can cost, you now have an answer: 30% on average.

If you’re lucky like the Monte Carlo integration, you’ll only lose 7%, but of course if you’re unlucky like the sparse matrix multiplication, you can lose 50%…

So if you’re doing a lot of 32bits floating-point, it’s probably best to avoid XE6 until this gets fixed.

32 thoughts on “Delphi XE6 32bits and Scimark”

David says:

May 8, 2014 at 12:22

The spread between Delphi and FPC continues to grow.

http://pastebin.com/N51M6gTy
Peter says:

May 8, 2014 at 12:29

QC number?
Dalija Prasnikar says:

May 8, 2014 at 14:10

Can you please report these findings in QC.
sam says:

May 8, 2014 at 18:00

There must be something wrong.

I got also “negative” Values when testing on my machine.
>>> Sparse matmult Mflops: -485.50 (N=1000, nz=5000) // ???
Herbert Sauro says:

May 8, 2014 at 19:36

Those numbers are really bad, 50% drop on the matrix multiplication! So the advice might be, do all your calculations using fast pascal and write the GUI code with XE6?
sam says:

May 9, 2014 at 06:38

Hello Herbert,

I try to explain one more:
There is something wrong with the code in Scimark.
The test “Sparse matmult” returns in some cases a negative Value (-485.50 Mflops)
e.g. :>>> Sparse matmult Mflops: -485.50 (N=1000, nz=5000) // ???

This is impossible.
I have seen some results posted by other people with Delphi XE5/6 ( with negative Values of Sparse matmult)
I can reproduce the negative Values on Linux with FPC 2.6.4.

Conclusion:
1.) don’t trust the numbers of the current Scimark implementation.
2.) it’s a problem with Scimark Pascal/Delphi Sourcecode. ( because it happens in Delphi and FPC)

Regards,

Sam
Andreas says:

May 9, 2014 at 08:26

Maybe someone forgot to initialize a variable for one of the tests in scimark, but that assembler still doesn’t look good – you don’t need a benchmark to see that it’s bad. Honestly I don’t understand why people still care about new releases of Delphi – it’s entertaining, but to me it looks like a complete waste of time.
More interesting to see would be an analysis of the differences between Delphi 7 and FPC. In my experience applications compiled with Delphi 7 are still about 5-10% faster.
Eric Grange says:

May 9, 2014 at 09:05

@sam never saw any negative values here on the various machines I ran it on. Looking at the code, it’s using GetTickCount and looks correct… but are you testing in a VM with a time drift problem? (this a common issue with vmware, KVM, etc. google for “time drift vm”)
sam says:

May 9, 2014 at 09:16

Hello Eric,

look at thread
http://forum.lazarus.freepascal.org/index.php/topic,24509.msg147584.html?PHPSESSID=10cf0d3560f7a37adda1e58da7a24b98#msg147584

User Fiji published his results: See results for Delphi XE6 64.
I ran the test with FPC 2.6.4 on win64 and got negative results as well.
Eric Grange says:

May 9, 2014 at 09:41

Hmm, well the “cycles” value is an integer and multiplied by two at each run, and thus could overflow into negative range if more than 31 runs are necessary. However, the way it scales iterations, it would be surprising to see that complexity reached (highest I’ve seen it go was 18).

So you could try to set a break-point after the loop and see the value you get.

Were you running in a VM? If so time drift would still be my prime suspect by far (and DelphiFreak might have used one as well for all we know), time drifting is a very common issue in vmware & kvm.
sam says:

May 9, 2014 at 09:45

Thanks for your investigation.

And to answer your question:
I was not running in a VM.
David says:

May 9, 2014 at 09:52

@sam this is because FPC supports SSE2, 3, 4.1, 4.2, AVX, AVX2
Its just that much faster.
Unspoken says:

May 9, 2014 at 11:05

And guys responsible for FPC compiler do not care that much about the speed. There is a lot of room for improvement.
From the other side, look at Delphi – there is no support even for MMX instructions which have been introduced like…18 years ago?!
Unspoken says:

May 9, 2014 at 11:07

Seeing performance degradation with each new version of the Delphi compiler it is safe to say that Embracadero does not have any performance tests.
Leif says:

May 9, 2014 at 11:54

I took the liberty to add your example into this QC report:
Reported as QC124652 “x32 compiler regression for floating point expressions” http://qc.embarcadero.com/wc/qcmain.aspx?d=124652
Eric Grange says:

May 9, 2014 at 12:00

@Leif Thanks. Note that btw it occurs in a variety (most?) of expressions as well.
David says:

May 9, 2014 at 13:42

They can never be competitive with their own compiler against proven solutions such as ICC, GCC etc. so the only wise thing would be to just use LLVM. I mean after all C++ Builder uses it as well.

I thought that they were actually doing something to improve the compiler.

@Eric you should do an article about poor performance of Record Helpers in XE6. The underlying assembly will suprise you. 😉
Herbert says:

May 9, 2014 at 17:24

Thanks for the clarification Sam. I do a bit of 32 bit floating point computations so the article was of interest to me.
LDS says:

May 9, 2014 at 18:38

They will be never competitive with their own compiler as long as they hire developers just because they are cheap and not because they are (very) good. Where is the compiler developed now? Iasi? Alicante? Who are the compiler guys now? What credentials they have? Sure, going along the line of LLVM would be a “safe road” – you use a compiler developed elsewhere and with a good record, on the other side you are bound to a compiler developed elsewhere by someone with his requirements and targets, not yours.
sam says:

May 10, 2014 at 15:33

***
Posts: 133

Re: Scimark
« Reply #11 on: Today at 09:41:29 am »
I have changed (in unit SparseCompRow)
function SparseCompRow_num_flops(N, nz, num_iterations: integer): integer;
to
function SparseCompRow_num_flops(N, nz, num_iterations: integer):cardinal;
and the problem with negative MFlops is gone.

Lazarus 1.2.2 directory FPC 2.6.4 i386-linux-gtk 2

original code
*************
Composite Score MFlops: 490.03
FFT Mflops: 250.03 (N=1024)
SOR Mflops: 836.35 (100 x 100)
MonteCarlo: Mflops: 170.81
Sparse matmult Mflops: 610.20 (N=1000, nz=5000)
LU Mflops: 582.73 (M=100, N=100)

my little optimization in procedure SparseCompRow_matmult ( remove the un-needed local vars rowR, rowRp1: integer; )
Composite Score MFlops: 520.78
FFT Mflops: 262.21 (N=1024)
SOR Mflops: 847.01 (100 x 100)
MonteCarlo: Mflops: 171.58
Sparse matmult Mflops: 738.64 (N=1000, nz=5000)
LU Mflops: 584.47 (M=100, N=100)
Michel says:

May 10, 2014 at 16:14

The main problem is, that they have nearly the same codegen as in Delphi 2 (the first 32 bit compiler). They did nothing in the last 19 years in the codegen area. Many high performance asm instruction, introduced 10 years ago by Intel and AMD, where never been used in the Delphi compiler. The optimization in dcc is a horror. Sad!
David says:

May 10, 2014 at 23:16

Why do people still buy Delphi if its getting slower and slower each new version. I guess $3000 or more is not enough for a shitty compiler + accessories.
FMXExpress says:

May 11, 2014 at 04:01

Delphi XE5 64bit Inline Combo
Composite Score MFlops: 1015.83
FFT Mflops: 1014.68 (N=1024)
SOR Mflops: 1126.54 (100 x 100) //inline
MonteCarlo: Mflops: 212.03
Sparse matmult Mflops: 645.36 (N=1000, nz=5000) //inline // local var removal
LU Mflops: 2080.51 (M=100, N=100)

Delphi XE6 64bit Inline Combo
Composite Score MFlops: 1001.70
FFT Mflops: 996.71 (N=1024)
SOR Mflops: 1120.97 (100 x 100) //inline
MonteCarlo: Mflops: 216.13
Sparse matmult Mflops: 630.76 (N=1000, nz=5000) //inline // local var removal
LU Mflops: 2043.91 (M=100, N=100)

FPC 2.6.4 64bit Inline Combo (I don’t know if I did the 64bit right here as the numbers are almost the same as 32bit)
Composite Score MFlops: 613.24
FFT Mflops: 416.54 (N=1024)
SOR Mflops: 1111.04 (100 x 100) // inline
MonteCarlo: Mflops: 122.71
Sparse matmult Mflops: 559.24 (N=1000, nz=5000) //inline // local var removal
LU Mflops: 856.68 (M=100, N=100)
Eric Grange says:

May 11, 2014 at 05:43

Before you start tweaking the code, remember that the point of the benchmark is to compare compiler performance from similar starting code, once you start throwing in tweaks and optimization, you no longer compare the ability if the compiler to handle trivial optimizations.

And if minor tweaks have noticeable effects (like removing local vars), that only underlines the compiler shortcomings…
sam says:

May 11, 2014 at 09:24

Hi Eric,
I (more or less) agree with you.

But in general, I wanted to make people watch out to not take just some “numbers” from the web and then do quick conclusion. ( I have another current thread about TIOBE where I tried to make the audience think a little 😉

Here my personal final conclusion:
1.) someone translated Scirmark from Programming Language XYZ to Delphi.
2.) another person translated Scimark from Delphi to Lazarus.
3.) I can now compile&run the Scimark on my computer on windows with the Delphi Versions I own.
4.) I can now compile&run the Scimark on my computer on linux&windows with different FPC versions.
5.) I can compare these results and play with the different compiler settings available in FPC and Delphi.
6.) I can make changes to this sources and see how I can optimize this stuff.
7.) if a benchmark shows me “negative” MFlops” then I get skeptic!
David says:

May 11, 2014 at 14:37

@sam I used the exact same code in Delphi and FPC. No changes whatsoever. So pound for pound FPC destroys Delphi. AFAIK Delphi 32 bit compiler doesn’t even support SSE.

@FMXExpress Use trunk. 2.6.4 is just too slow.
sam says:

May 16, 2014 at 08:12

Some more data:

And as always! Only trust your one investigation’s.
My personal conclusion about this values:

1. We need to investigate more about FFT.

Borland 2002
************
Scimark_Delphi7_win32.exe
** **
** SciMark2a Numeric Benchmark, see http://math.nist.gov/scimark **
** **
** Delphi Port, see http://code.google.com/p/scimark-delphi/ **
** **
Mininum running time = 2.00 seconds
Composite Score MFlops: 562.40
FFT Mflops: 279.53 (N=1024)
SOR Mflops: 863.72 (100 x 100)
MonteCarlo: Mflops: 126.50
Sparse matmult Mflops: 398.27 (N=1000, nz=5000)
LU Mflops: 1143.97 (M=100, N=100)

FreePascal 2014
***************
Scimark_Lazarus_win32_2_7_1.exe
** **
** SciMark2a Numeric Benchmark, see http://math.nist.gov/scimark **
** **
** Lazarus Port **
** **
Mininum running time = 2.00 seconds
Composite Score MFlops: 612.69
FFT Mflops: 270.09 (N=1024)
SOR Mflops: 888.82 (100 x 100)
MonteCarlo: Mflops: 137.66
Sparse matmult Mflops: 567.66 (N=1000, nz=5000)
LU Mflops: 1199.24 (M=100, N=100)

VisualStudio 2013
*****************
SciMark_C#_win32.exe
** **
** SciMark2a Numeric Benchmark, see http://math.nist.gov/scimark **
** **
Mininum running time = 2 seconds

Composite Score: 622.46 MFlops
FFT : 567.56 – (1024)
SOR : 895.09 – (100×100)
Monte Carlo : 77.86
Sparse MatMult : 571.57 – (N=1000, nz=5000)
LU : 1000.24 – (100×100)
stone says:

May 16, 2014 at 09:04

QC 124652 had been closed.
Eric Grange says:

May 16, 2014 at 09:10

> QC 124652 had been closed.

Indeed… Also just after the closure Tomohiro Takahashi commented with “I will check why this report was closed…”
Leif says:

May 19, 2014 at 10:06

QC 124652 now has been marked as resolved.
Let’s see if the correction will be released in an update or in XE7.
David says:

May 20, 2014 at 23:04

@Leif you mean XE6 SP1 🙂
stone says:

May 21, 2014 at 12:23

FMX.Types.pas (XE6)

TTextTrimming = (None, Character, Word);　(line 123)
TPixelFormat = (None, RGB, RGBA, BGR…)；　(line 181)
TAlignLayout = (None, Top, Left, Right, …);　(line 240)
TDragOperation = (None, Move, Copy, Link); 　(line 281)
TAdjustType = (None, FixedSize, FixedWidth, FixedHeight); 　(line 584)

Delphi has no hope.

Comments are closed.

DelphiTools

DWS, Profiler and other Pascal tools

Delphi XE6 32bits and Scimark

Scimark2 results

Investigating SOR

Conclusion

32 thoughts on “Delphi XE6 32bits and Scimark”

Scimark2 results

Investigating SOR

Conclusion

Related posts

32 thoughts on “Delphi XE6 32bits and Scimark”