Delphi XE6 32bits and Scimark

8087In a Google+ comment to my recent article about inlining in XE6, Leif Uneus posted results from Scimark.

It appears that XE6 is about 30% slower than previous versions at least from XE5 to XE for 32bits floating point.
Note that Scimark does not make use of inlining, but does make heavy use of floating-point computations, loops and arrays.

Edit: issue discussed here was reported in QC 124652 (now marked as resolved)

Scimark2 results

Here are the Scimark2 results reported by Leif Uneus (I was able to replicate similar ratios on both an AMD Phenom and an Intel E5 in a XE6 vs XE comparison)

XE6 Win32 Results:

Mininum running time = 2,00 seconds
Composite Score MFlops:   632,06
FFT             Mflops:   297,35    (N=1024)
SOR             Mflops:   895,01    (100 x 100)
MonteCarlo:     Mflops:   184,05
Sparse matmult  Mflops:   360,58    (N=1000, nz=5000)
LU              Mflops:  1423,33    (M=100, N=100)

XE5 Win32 Results:

Mininum running time = 2,00 seconds
Composite Score MFlops:   859,98
FFT             Mflops:   390,91    (N=1024)
SOR             Mflops:  1193,53    (100 x 100)
MonteCarlo:     Mflops:   198,91
Sparse matmult  Mflops:   538,50    (N=1000, nz=5000)
LU              Mflops:  1978,03    (M=100, N=100)

Interestingly enough, the slowdown affects all the algorithms in the Scimark bench, some more than others (Spare matrix-multiplications takes a 50% hit!), but all are affected.

So I investigated the simplest one, which is Successive Over-Relaxation, aka “SOR”.

Investigating SOR

The SOR test features an inner-loop where the bulk of execution time is conveniently found, and a it’s a loop with a single line in it:

Gi[j] :=   omega_over_four * (Gim1[j] + Gip1[j] + Gi[j - 1] + Gi[j + 1])
         + one_minus_omega * Gi[j];

Here is what the Delphi XE compilers generates:

0040FE44 DD04C1           fld qword ptr [ecx+eax*8]
0040FE47 8B75E4           mov esi,[ebp-$1c]
0040FE4A DC04C6           fadd qword ptr [esi+eax*8]
0040FE4D DC44C2F8         fadd qword ptr [edx+eax*8-$08]
0040FE51 DC44C208         fadd qword ptr [edx+eax*8+$08]
0040FE55 DC4DF0           fmul qword ptr [ebp-$10]
0040FE58 DD45E8           fld qword ptr [ebp-$18]
0040FE5B DC0CC2           fmul qword ptr [edx+eax*8]
0040FE5E DEC1             faddp st(1)
0040FE60 DD1CC2           fstp qword ptr [edx+eax*8]
0040FE63 9B               wait

And here is the output from Delphi XE6. I bolded the meaningful differences:

0041A872 8B75E4           mov esi,[ebp-$1c]                
0041A875 DD04C1           fld qword ptr [ecx+eax*8]
0041A878 DC04C6           fadd qword ptr [esi+eax*8]
0041A87B DC44C2F8         fadd qword ptr [edx+eax*8-$08]
0041A87F DC44C208         fadd qword ptr [edx+eax*8+$08]
0041A883 DC4DF0           fmul qword ptr [ebp-$10]
0041A886 DD45E8           fld qword ptr [ebp-$18]
0041A889 DC0CC2           fmul qword ptr [edx+eax*8]
0041A88C DEC1             faddp st(1)
0041A88E DD5DC8           fstp qword ptr [ebp-$38]
0041A891 9B               wait 
0041A892 DD45C8           fld qword ptr [ebp-$38]       // Stack Juggling
0041A895 DD1CC2           fstp qword ptr [edx+eax*8]    // Stack Juggling
0041A898 9B               wait                          // Stack Juggling

The key difference is the unnecessary stack juggling at the end, which looks suspiciously similar to the one uncovered in the XE6 inlining article… coincidence?

(“stack juggling” means that the compiler stores value to the stack, then immediately loads them back)

There is another minor difference at the beginning, which should be negated by  any modern CPU’s out-of order execution capability (though XE’s version is probably preferable).

In terms of loop alignment, both XE & XE6 are similary bad, ie. unaligned.


If you were wondered how much unnecessary stack juggling can cost, you now have an answer: 30% on average.

If you’re lucky like the Monte Carlo integration, you’ll only lose 7%, but of course if you’re unlucky like the sparse matrix multiplication, you can lose 50%…

So if you’re doing a lot of 32bits floating-point, it’s probably best to avoid XE6 until this gets fixed.

32 thoughts on “Delphi XE6 32bits and Scimark

  1. There must be something wrong.

    I got also “negative” Values when testing on my machine.
    >>> Sparse matmult Mflops: -485.50 (N=1000, nz=5000) // ???

  2. Those numbers are really bad, 50% drop on the matrix multiplication! So the advice might be, do all your calculations using fast pascal and write the GUI code with XE6?

  3. Hello Herbert,

    I try to explain one more:
    There is something wrong with the code in Scimark.
    The test “Sparse matmult” returns in some cases a negative Value (-485.50 Mflops)
    e.g. :>>> Sparse matmult Mflops: -485.50 (N=1000, nz=5000) // ???

    This is impossible.
    I have seen some results posted by other people with Delphi XE5/6 ( with negative Values of Sparse matmult)
    I can reproduce the negative Values on Linux with FPC 2.6.4.

    1.) don’t trust the numbers of the current Scimark implementation.
    2.) it’s a problem with Scimark Pascal/Delphi Sourcecode. ( because it happens in Delphi and FPC)



  4. Maybe someone forgot to initialize a variable for one of the tests in scimark, but that assembler still doesn’t look good – you don’t need a benchmark to see that it’s bad. Honestly I don’t understand why people still care about new releases of Delphi – it’s entertaining, but to me it looks like a complete waste of time.
    More interesting to see would be an analysis of the differences between Delphi 7 and FPC. In my experience applications compiled with Delphi 7 are still about 5-10% faster.

  5. @sam never saw any negative values here on the various machines I ran it on. Looking at the code, it’s using GetTickCount and looks correct… but are you testing in a VM with a time drift problem? (this a common issue with vmware, KVM, etc. google for “time drift vm”)

  6. Hmm, well the “cycles” value is an integer and multiplied by two at each run, and thus could overflow into negative range if more than 31 runs are necessary. However, the way it scales iterations, it would be surprising to see that complexity reached (highest I’ve seen it go was 18).

    So you could try to set a break-point after the loop and see the value you get.

    Were you running in a VM? If so time drift would still be my prime suspect by far (and DelphiFreak might have used one as well for all we know), time drifting is a very common issue in vmware & kvm.

  7. Thanks for your investigation.

    And to answer your question:
    I was not running in a VM.

  8. @sam this is because FPC supports SSE2, 3, 4.1, 4.2, AVX, AVX2
    Its just that much faster.

  9. And guys responsible for FPC compiler do not care that much about the speed. There is a lot of room for improvement.
    From the other side, look at Delphi – there is no support even for MMX instructions which have been introduced like…18 years ago?!

  10. Seeing performance degradation with each new version of the Delphi compiler it is safe to say that Embracadero does not have any performance tests.

  11. They can never be competitive with their own compiler against proven solutions such as ICC, GCC etc. so the only wise thing would be to just use LLVM. I mean after all C++ Builder uses it as well.

    I thought that they were actually doing something to improve the compiler.

    @Eric you should do an article about poor performance of Record Helpers in XE6. The underlying assembly will suprise you. 😉

  12. Thanks for the clarification Sam. I do a bit of 32 bit floating point computations so the article was of interest to me.

  13. They will be never competitive with their own compiler as long as they hire developers just because they are cheap and not because they are (very) good. Where is the compiler developed now? Iasi? Alicante? Who are the compiler guys now? What credentials they have? Sure, going along the line of LLVM would be a “safe road” – you use a compiler developed elsewhere and with a good record, on the other side you are bound to a compiler developed elsewhere by someone with his requirements and targets, not yours.

  14. ***
    Posts: 133

    Re: Scimark
    « Reply #11 on: Today at 09:41:29 am »
    I have changed (in unit SparseCompRow)
    function SparseCompRow_num_flops(N, nz, num_iterations: integer): integer;
    function SparseCompRow_num_flops(N, nz, num_iterations: integer):cardinal;
    and the problem with negative MFlops is gone.

    Lazarus 1.2.2 directory FPC 2.6.4 i386-linux-gtk 2

    original code
    Composite Score MFlops: 490.03
    FFT Mflops: 250.03 (N=1024)
    SOR Mflops: 836.35 (100 x 100)
    MonteCarlo: Mflops: 170.81
    Sparse matmult Mflops: 610.20 (N=1000, nz=5000)
    LU Mflops: 582.73 (M=100, N=100)

    my little optimization in procedure SparseCompRow_matmult ( remove the un-needed local vars rowR, rowRp1: integer; )
    Composite Score MFlops: 520.78
    FFT Mflops: 262.21 (N=1024)
    SOR Mflops: 847.01 (100 x 100)
    MonteCarlo: Mflops: 171.58
    Sparse matmult Mflops: 738.64 (N=1000, nz=5000)
    LU Mflops: 584.47 (M=100, N=100)

  15. The main problem is, that they have nearly the same codegen as in Delphi 2 (the first 32 bit compiler). They did nothing in the last 19 years in the codegen area. Many high performance asm instruction, introduced 10 years ago by Intel and AMD, where never been used in the Delphi compiler. The optimization in dcc is a horror. Sad!

  16. Why do people still buy Delphi if its getting slower and slower each new version. I guess $3000 or more is not enough for a shitty compiler + accessories.

  17. Delphi XE5 64bit Inline Combo
    Composite Score MFlops: 1015.83
    FFT Mflops: 1014.68 (N=1024)
    SOR Mflops: 1126.54 (100 x 100) //inline
    MonteCarlo: Mflops: 212.03
    Sparse matmult Mflops: 645.36 (N=1000, nz=5000) //inline // local var removal
    LU Mflops: 2080.51 (M=100, N=100)

    Delphi XE6 64bit Inline Combo
    Composite Score MFlops: 1001.70
    FFT Mflops: 996.71 (N=1024)
    SOR Mflops: 1120.97 (100 x 100) //inline
    MonteCarlo: Mflops: 216.13
    Sparse matmult Mflops: 630.76 (N=1000, nz=5000) //inline // local var removal
    LU Mflops: 2043.91 (M=100, N=100)

    FPC 2.6.4 64bit Inline Combo (I don’t know if I did the 64bit right here as the numbers are almost the same as 32bit)
    Composite Score MFlops: 613.24
    FFT Mflops: 416.54 (N=1024)
    SOR Mflops: 1111.04 (100 x 100) // inline
    MonteCarlo: Mflops: 122.71
    Sparse matmult Mflops: 559.24 (N=1000, nz=5000) //inline // local var removal
    LU Mflops: 856.68 (M=100, N=100)

  18. Before you start tweaking the code, remember that the point of the benchmark is to compare compiler performance from similar starting code, once you start throwing in tweaks and optimization, you no longer compare the ability if the compiler to handle trivial optimizations.

    And if minor tweaks have noticeable effects (like removing local vars), that only underlines the compiler shortcomings…

  19. Hi Eric,
    I (more or less) agree with you.

    But in general, I wanted to make people watch out to not take just some “numbers” from the web and then do quick conclusion. ( I have another current thread about TIOBE where I tried to make the audience think a little 😉

    Here my personal final conclusion:
    1.) someone translated Scirmark from Programming Language XYZ to Delphi.
    2.) another person translated Scimark from Delphi to Lazarus.
    3.) I can now compile&run the Scimark on my computer on windows with the Delphi Versions I own.
    4.) I can now compile&run the Scimark on my computer on linux&windows with different FPC versions.
    5.) I can compare these results and play with the different compiler settings available in FPC and Delphi.
    6.) I can make changes to this sources and see how I can optimize this stuff.
    7.) if a benchmark shows me “negative” MFlops” then I get skeptic!

  20. @sam I used the exact same code in Delphi and FPC. No changes whatsoever. So pound for pound FPC destroys Delphi. AFAIK Delphi 32 bit compiler doesn’t even support SSE.

    @FMXExpress Use trunk. 2.6.4 is just too slow.

  21. Some more data:

    And as always! Only trust your one investigation’s.
    My personal conclusion about this values:

    1. We need to investigate more about FFT.

    Borland 2002
    ** **
    ** SciMark2a Numeric Benchmark, see **
    ** **
    ** Delphi Port, see **
    ** **
    Mininum running time = 2.00 seconds
    Composite Score MFlops: 562.40
    FFT Mflops: 279.53 (N=1024)
    SOR Mflops: 863.72 (100 x 100)
    MonteCarlo: Mflops: 126.50
    Sparse matmult Mflops: 398.27 (N=1000, nz=5000)
    LU Mflops: 1143.97 (M=100, N=100)

    FreePascal 2014
    ** **
    ** SciMark2a Numeric Benchmark, see **
    ** **
    ** Lazarus Port **
    ** **
    Mininum running time = 2.00 seconds
    Composite Score MFlops: 612.69
    FFT Mflops: 270.09 (N=1024)
    SOR Mflops: 888.82 (100 x 100)
    MonteCarlo: Mflops: 137.66
    Sparse matmult Mflops: 567.66 (N=1000, nz=5000)
    LU Mflops: 1199.24 (M=100, N=100)

    VisualStudio 2013
    ** **
    ** SciMark2a Numeric Benchmark, see **
    ** **
    Mininum running time = 2 seconds

    Composite Score: 622.46 MFlops
    FFT : 567.56 – (1024)
    SOR : 895.09 – (100×100)
    Monte Carlo : 77.86
    Sparse MatMult : 571.57 – (N=1000, nz=5000)
    LU : 1000.24 – (100×100)

  22. > QC 124652 had been closed.

    Indeed… Also just after the closure Tomohiro Takahashi commented with “I will check why this report was closed…”

  23. QC 124652 now has been marked as resolved.
    Let’s see if the correction will be released in an update or in XE7.

  24. FMX.Types.pas (XE6)

    TTextTrimming = (None, Character, Word); (line 123)
    TPixelFormat = (None, RGB, RGBA, BGR…); (line 181)
    TAlignLayout = (None, Top, Left, Right, …); (line 240)
    TDragOperation = (None, Move, Copy, Link);  (line 281)
    TAdjustType = (None, FixedSize, FixedWidth, FixedHeight);  (line 584)

    Delphi has no hope.

Comments are closed.