- DelphiTools - https://www.delphitools.info -

XE2 single-precision floating point (partial) disappointment…

In the previous episode [1], it appeared that Delphi XE2 64bit compiler was achieving quite good results, however, after further investigations, things may not be so clear-cut. Transcendental maths, which will be food for a another post, the subject of this one seems to be an issue with single-precision floating point maths.

edit: it appeared there is an undocumented {$EXCESSPRECISION OFF} directive which controls the generation of the conversion opcodes hampering single-precisions floating point performance, the articles has been updated. Thanks Allen Bauer, Andreano Lanusse & Leif Uneus for bringing it to attention!

Single precision

Single precision having a smaller memory footprint and being typically processed faster (especially when using SIMD [2], f.i. SSE allows processing 4 single-precision floats at the same time, while you can process only 2 double-precision floats at a time with SSE2), thus it is often encountered in performance-critical code where precision isn’t essential. One typical such use is for 3D computations, meshes, and geometry.

Most 3D engines out there make heavy use of single-precision floating point (GLScene and thus [3] FireMonkey too), and it’s the primary native float data type expected by most graphics hardware.

Updated benchmark charts

However, the new 64bit compiler doesn’t like single precision floats, while the 32bit compiler likes them, this leads to this interesting chart:

Mandelbrot times (ms), lower is better

XE2 – 32 bits XE2 – 64 bits
Single Precision… 115 257 / 66*
Double Precision… 193 67

There are two figures in the 64bit single precision case, the high figure is what you see if you just compile with optimizations (yes, you’re seeing this right, single precision floating point math in Delphi 64bit behaves worse than double-precision maths in Delphi 32bits!), and the low figure is if you use the undocumented (up until this article)  {$EXCESSPRECISION OFF} directive.

The new XE 64bit compiler can give you the best, and the worst: using single precision floats can make your 64 bits code almost 4 times slower if you don’t turn off “excess precision”, while it can make 32 bits code 70% faster…

Why, oh why?

The reason? The 64bit compiler doesn’t use scalar single precision opcodes if you don’t have “excess precision”, turned off and converts everything back and forth to double precision. Here is a snippet from the CPU view:

FMandelTest.pas.193: x := x0 * x0 - y0 * y0 + p;
00000000005A1468 F3480F5AC4       cvtss2sd xmm0,xmm4
00000000005A146D F3480F5ACC       cvtss2sd xmm1,xmm4
00000000005A1472 F20F59C1         mulsd xmm0,xmm1
00000000005A1476 F3480F5ACD       cvtss2sd xmm1,xmm5
00000000005A147B F34C0F5AC5       cvtss2sd xmm8,xmm5
00000000005A1480 F2410F59C8       mulsd xmm1,xmm8
00000000005A1485 F20F5CC1         subsd xmm0,xmm1
00000000005A1489 F3480F5ACA       cvtss2sd xmm1,xmm2
00000000005A148E F20F58C1         addsd xmm0,xmm1
00000000005A1492 F2480F5AC0       cvtsd2ss xmm0,xmm0

This is the similar code as for double precision, with loads of cvtss2sd & cvtsd2ss instructions thrown in! No mulss, subss or addss in sight, and yes, you can see redundant stuff happening, 4 lines are doing the actual computation, 6 are doing conversions, and doing them every… single… time.

If you’re a fan of “Lucky Luke [4]“, the first two lines may remind you of a Dalton brothers prison break (even though the brothers are in the same cell, they each dig their own hole to freedom) 😉

Now if you have the  {$EXCESSPRECISION OFF} directive, you see a different picture, the compiler uses single-precision opcodes as expected:

FMandelTest.pas.194: x := x0 * x0 - y0 * y0 + p;
00000000005A1450 0F28C4           movaps xmm0,xmm4
00000000005A1453 F30F59C4         mulss xmm0,xmm4
00000000005A1457 0F28CD           movaps xmm1,xmm5
00000000005A145A F30F59CD         mulss xmm1,xmm5
00000000005A145E F30F5CC1         subss xmm0,xmm1
00000000005A1462 F30F58C2         addss xmm0,xmm2

As Ville Krumlinde pointed in the comments, VS 2010 C-compiler has the same weird behavior [5].

I say weird, because if you go to the length of specifying single precision floating point, it’s usually because you mean it, and it’s trivial enough to have an expression be automatically promoted to double-precision by throwing in a Double operand or cast.

This reminds me of the old $STRINGCHECKS directive, which one had to remember to adjust or suffer lower string performance. Hopefully the hand holding will be reversed in the next version, with excess precision being off by default.