XE2 single-precision floating point (partial) disappointment…

In the previous episode, it appeared that Delphi XE2 64bit compiler was achieving quite good results, however, after further investigations, things may not be so clear-cut. Transcendental maths, which will be food for a another post, the subject of this one seems to be an issue with single-precision floating point maths.

edit: it appeared there is an undocumented {$EXCESSPRECISION OFF} directive which controls the generation of the conversion opcodes hampering single-precisions floating point performance, the articles has been updated. Thanks Allen Bauer, Andreano Lanusse & Leif Uneus for bringing it to attention!

Single precision

Single precision having a smaller memory footprint and being typically processed faster (especially when using SIMD, f.i. SSE allows processing 4 single-precision floats at the same time, while you can process only 2 double-precision floats at a time with SSE2), thus it is often encountered in performance-critical code where precision isn’t essential. One typical such use is for 3D computations, meshes, and geometry.

Most 3D engines out there make heavy use of single-precision floating point (GLScene and thus FireMonkey too), and it’s the primary native float data type expected by most graphics hardware.

Updated benchmark charts

However, the new 64bit compiler doesn’t like single precision floats, while the 32bit compiler likes them, this leads to this interesting chart:

Mandelbrot times (ms), lower is better

XE2 – 32 bits XE2 – 64 bits
Single Precision… 115 257 / 66*
Double Precision… 193 67

There are two figures in the 64bit single precision case, the high figure is what you see if you just compile with optimizations (yes, you’re seeing this right, single precision floating point math in Delphi 64bit behaves worse than double-precision maths in Delphi 32bits!), and the low figure is if you use the undocumented (up until this article)  {$EXCESSPRECISION OFF} directive.

The new XE 64bit compiler can give you the best, and the worst: using single precision floats can make your 64 bits code almost 4 times slower if you don’t turn off “excess precision”, while it can make 32 bits code 70% faster…

Why, oh why?

The reason? The 64bit compiler doesn’t use scalar single precision opcodes if you don’t have “excess precision”, turned off and converts everything back and forth to double precision. Here is a snippet from the CPU view:

FMandelTest.pas.193: x := x0 * x0 - y0 * y0 + p;
00000000005A1468 F3480F5AC4       cvtss2sd xmm0,xmm4
00000000005A146D F3480F5ACC       cvtss2sd xmm1,xmm4
00000000005A1472 F20F59C1         mulsd xmm0,xmm1
00000000005A1476 F3480F5ACD       cvtss2sd xmm1,xmm5
00000000005A147B F34C0F5AC5       cvtss2sd xmm8,xmm5
00000000005A1480 F2410F59C8       mulsd xmm1,xmm8
00000000005A1485 F20F5CC1         subsd xmm0,xmm1
00000000005A1489 F3480F5ACA       cvtss2sd xmm1,xmm2
00000000005A148E F20F58C1         addsd xmm0,xmm1
00000000005A1492 F2480F5AC0       cvtsd2ss xmm0,xmm0

This is the similar code as for double precision, with loads of cvtss2sd & cvtsd2ss instructions thrown in! No mulss, subss or addss in sight, and yes, you can see redundant stuff happening, 4 lines are doing the actual computation, 6 are doing conversions, and doing them every… single… time.

If you’re a fan of “Lucky Luke“, the first two lines may remind you of a Dalton brothers prison break (even though the brothers are in the same cell, they each dig their own hole to freedom) 😉

Now if you have the  {$EXCESSPRECISION OFF} directive, you see a different picture, the compiler uses single-precision opcodes as expected:

FMandelTest.pas.194: x := x0 * x0 - y0 * y0 + p;
00000000005A1450 0F28C4           movaps xmm0,xmm4
00000000005A1453 F30F59C4         mulss xmm0,xmm4
00000000005A1457 0F28CD           movaps xmm1,xmm5
00000000005A145A F30F59CD         mulss xmm1,xmm5
00000000005A145E F30F5CC1         subss xmm0,xmm1
00000000005A1462 F30F58C2         addss xmm0,xmm2

As Ville Krumlinde pointed in the comments, VS 2010 C-compiler has the same weird behavior.

I say weird, because if you go to the length of specifying single precision floating point, it’s usually because you mean it, and it’s trivial enough to have an expression be automatically promoted to double-precision by throwing in a Double operand or cast.

This reminds me of the old $STRINGCHECKS directive, which one had to remember to adjust or suffer lower string performance. Hopefully the hand holding will be reversed in the next version, with excess precision being off by default.

21 thoughts on “XE2 single-precision floating point (partial) disappointment…

  1. This means if you need heavy computation, use C++, the Delphi team has proven that they can’t make a decent optimizing compiler even after all this years, they’re still stuck on old Pentiums, you should check the integer benchmark (MD5, …) the compiler is really slow compared to C++ (VCC, GCC) and I mean SLOW!

  2. Eric,

    this may be out of scope, but I would like to see a comparison Delphi vs. C#/.Net – any chance to get this?

  3. The it converts x0 twice in a expression like “x0*x0” looks particularly clumsy.

    Another example:

    s1 : single;

    s1 := s1*1;

    This generates conversions from single to double and then back to single again (without any multiplication because the compiler has figured out that multiply with 1.0 is unnecessary 🙂 ).

  4. Michael Thuma :

    Does this mean – simply use double? (Requires more memory then).

    Yes, and if you pass doubles to hardware via OpenGL or DirectX, you’ll likely face a downconversion to single and an hidden buffer (ie. up to triple memory use…). Also it is not possible for direct buffering techniques, so using doubles isn’t really a solution…

  5. @Eric
    Thank you very much. This is not so good because one has to be very careful when thinking about the scenario applied to and cannot simply rely on a speedup.

    @DEV…especially when someone is skilled enough to combine .net and manged C/C++ for example. ‘For the Files’: Sequence is still faster in a hand optimized C code in sequence. Currently experimenting on the client side (don’t do this on the server side) with a threading model on Windows 7 that should work now in VS2011 C/C++ that allows a ‘batch execute’ a burst controlled by a worker – on one hand controlled by the developer and on the other hand by the OS in order to hinder the worker from consuming > 80% CPU time (avoid context switches between competing ‘processes’). (can be found in the last MSDN – nothing magic but had problems in the past when cleaning up after the calculation was done). I fear such things … who knows what tomorrow will bring.

    I did not expect this from Delphi … but I am little sad to hear what Eric found out.

  6. for heavy math, we use MtxVec, from dewresearch, they use the assembly code from Intel, with SSE2, 3, 4, whatever, with a new OpenCL library that uses the GPU on FireMonkey; it’s not very easy to implement but, if you need speed, it’s the best choice I could find (I’m talking about engineering calculations that take more than 1 hour on a new i5 quad-core).

  7. I posted this issue to the Embarcadero forums. Dalija Prasnikar asked me to create a QC entry.
    Eric: Shall I create an entry about this issue?

  8. Try the test again with {$EXCESSPRECISION OFF}. That will tell the compiler to use single-precision instructions, which means that intermediate results could loose precision, but that’s what you’re expecting in this case. Also, make sure you set $O+ and that will improve it further.

  9. I just saw a comment from a compiler guy at EMB.
    Setting a compiler switch $EXCESSPRECISION ON/OFF aroynd a code segment controls the how single floating point values are treated.
    Tested with your Mandelbrot and there was a significant speedup.

  10. Thanks for the benchmarks, could you try using the {$EXCESSPRECISION OFF} directive, the compiler will generate adds, mulss, subss, etc… and not do all the cvtss2sd and cvtsd2ss conversions.

  11. That’s great, the {$EXCESSPRECISION OFF} directive seems to do the trick! It should be documented. For my “s1 := s1*1;” example it now correctly optimize away that line completely. Without the directive it inserts totally unnecessary conversion instructions for that line as described above.

    This reminds me of the excessive code generation because of Delphi/C++ builder string compatibility a some years back that also could be turned off with a directive and thankfully was removed a few versions later.

    Is there a way to set this directive for the whole project without having to add it in each unit?

  12. I take it all back, what I said except for the integer optimization, I’d like to see how the compiler does when it comes to MD5 hash and so, maybe there is another secret directive that optimizes it further, good job Embarcadero but why isn’t this documented anyway?.

  13. Here is the code that nothing will help.. freepascal x64 – 47ms, delphi x32 – 37ms, delphi x64 – 2.8s!!!, ms vs c++ x64 – 93ms

    y, z: double;
    i, x: integer;
    t: Cardinal;
    write(‘Input X: ‘);
    t := GetTickCount;
    for i := 1 to 1000000 do
    y := i + (x / 2);
    z := sin(sqr(y * 3));
    t := (GetTickCount – t);

Comments are closed.