Archive

Posts Tagged ‘Optimization’

SynEdit performance bragging rights?

December 29th, 2011

SynEdit LogoCommitted another round of speedups in the SynEdit SVN, and AFAICT SynEdit is now amongst the fastest text and highlighting editors out there!

To benchmark it yourself, if you don’t have a large text file hanging around, you can make a “meaningful” one easily: just go to Delphi’s source\rtl\win directory and enter the following command-line:

type *.pas > c:\winall.pas

That should give you f.i. with  Delphi XE an 11 MB file, with 280k lines. Using the following file, here are my rough findings, for an executable compiled with Delphi XE, by order of decreasing performance:

Editor Time to load Time to reach last line Total
SynEdit << 1 sec 0 sec << 1 sec
Delphi XE < 1 sec << 1 sec < 1 sec
EmEditor (11.0.2) 0 sec 1* sec 1* sec
Scintilla SciTe 0 sec 2 sec 2 sec
Eclipse (Indigo) 3 sec 0 sec 3 sec
Notepad (Win7) 3 sec 0 sec 3 sec
Notepad++ (5.9.6.2) 0 sec 16 sec 16 sec

*: EmEditor  does its line indexing work in a background task, so the 1 sec delay to reach the last line is only visible if you try to reach it just after having opened the file.

Once you’ve reached the last line at least once, entering text at the top or the bottom of the file, selecting blocks, copy-pasting etc. is instantaneous enough in all above editors. Memory usage is roughly comparable, except for Eclipse, which is the worst by far.

Notepad++ is using Scintilla, but performs much worse than SciTe (Scintilla minimalistic text editor/demo), so there must be a glitch somewhere in Notepad++ (which I personally didn’t expect, must have been a regression of some sort in recent versions).

SynEdit also seems to scale better than  all the others above on even larger files (as tested on 100MB+ files).

For bragging rights, what other fast editors do you know of? ;-)

 

 

News ,

Fixing TCriticalSection

November 30th, 2011

TCriticalSection (along with TMonitor*) suffers from a severe design flaw in which entering/leaving different TCriticalSection instances can end up serializing your threads, and the whole can even end up performing worse than if your threads had been serialized.

This is because it’s a small, dynamically allocated object, so several TCriticalSection instances can end up in the same CPU cache line, and when that happens, you’ll have cache conflicts aplenty between the cores running the threads.

How severe can that be? Well, it depends on how many cores you have, but the more cores you have, the more severe it can get. On a quad core, a bad case of contention can easily result in a 200% slowdown on top of the serialization. And it won’t always be reproducible, since it’s related to dynamic memory allocation.

There is thankfully a simple fix for that, use TFixedCriticalSection:

type
   TFixedCriticalSection = class(TCriticalSection)
      private
         FDummy : array [0..95] of Byte;
   end;

That’s it folks. This makes sure the instance size larger than 96 bytes, which means that it’ll be larger than the cache line in all current CPUs, so no serialization anymore across distinct critical section instances.

As a bonus, it also ends up using one of the larger, more aligned, FastMM bucket, which seems to improve critical section code performance by about 7%. The downside is you use more RAM… but how many critical sections do you really have?

* (11-12-01): as noted by Allen Bauer in the comments, the issue is fixed for TMonitor in XE2.

Tips ,

Delphi XE2 VCL Styles and 3D

October 24th, 2011

…or when the old/new VCL mule shows it can still kick!

I was asked how hard it would be to do yet-another-Cover Flow-clone with VCL+GLScene, and how that would stand vs using FireMonkey on Windows.

GLFlow : VCL + OpenGL

The new Delphi XE2 Styles allow to get a nice looking UI, allowing to mix 3D graphics more smoothly without grayish controls getting in the way:

GLFlow - Powered by Delphi XE2 & GLScene

You can get the source and a pre-compiled executable from googlecode for a more interactive experience or a peek at the source code.

Note that in the video below, image load times have not been edited away: it’s actually near instantaneous (unsurprisingly so, the FireFlow sample pics are small ones)

The pictures above are those from FireMonkey’s “FireFlow”‘ sample, which you’ll already have if you have Delphi XE2 and have updated your samples directory.
You can select any folder with JPG pictures (and compare vs FireFlow), for instance on pictures from a digital camera (like these).

Now for some highlights:

  • High frame-rate
    • even if you shake the slider or click on pictures to the far right/left
    • no need for a gaming 3D card, even Netbook GPUs are enough
  • Fast loading of images in a background thread
    • almost instantaneous for the FireFlow samples
    • you can still interact while the loading takes place
    • works around the old TJPEGImage locking bug (QC43018, QC55871…)
  • Can handle large JPEG images
    • takes advantage of built-in TJPEGImage facilities for quicker loading  & downsizing.
  • Can animate with dozens of pictures without slowdown
    • benefits from built-in frustum culling (doesn’t draw what isn’t visible)
  • No pixel shimmering thanks to anisotropic filtering
    • GPU-accelerated mipmap generation

The amount of code used in GLFlow is actually lower than in the FireFlow sample, despite the extra features. Beyond the loading phase, everything is handled by the GPU, and this performs well even on low-end integrated 3D chipsets (gaming hardware not required, even mere Atom Netbooks can handle it just fine). FireFlow on the other hand seems to require a fast GPU and a fast CPU (it can maximizes one core when animating, maybe an issue in the animation classes?).

Interestingly enough, apart from VCL styles, there is nothing truly “new”: it could have been done almost the same back in Y2K with GLScene and Delphi 5 (if less pretty).

For the cross-platform fans, compiling it for the Mac (with LCL instead of VCL)  is left as an exercise ;-)

I’ve placed the source on google code, as CoverFlow is well, trendy and pretty, and can come in handy. There are also quite a few low-hanging improvements I may improve upon later on (having multiple background loaders, further texture optimizations, cross-platform through LCL, adding text below the pics, etc.).

News, Tips , ,

Memory Manager Investigations

October 13th, 2011

André Mussche on Google+ investigated the performance of several Memory Managers for Delphi, in single-threaded & multi-threaded situations, with detailed results and charts on performance and memory usage. Great work and interesting findings!

His conclusions (which I share)

For single threaded or low memory profile applications, the default Delphi memory manager (FastMM) is the fastest you can get. If you don’t realloc a lot (strings?), TCmalloc [from Google perftools] is fast too.

For multi threaded apps, it’s not easy to decided what to use. ScaleMM2 is the fastest but not stable. TCmalloc is a good one, but uses a lot of memory. MSVCRT [Microsoft allocator in msvcrt.dll] looks scalable in simple multi-threaded tests, but in extended test like FastCodeMMChallenge it is disappointing: slower and uses a lot of memory!
JeMalloc (used by the latest FireFox) is disappointing in multi-threaded areas, but uses the same low memory as FastMM: maybe FF can be made faster by using FastMM? :-)

Additionally, Hoard was tested, though it performed “off the charts” (in a literal and bad way).

You can check André’s charts for yourself:

All in all, for single-threaded applications, or when you have few threads or limited thread-based memory management, FastMM is still king of the Hill, and not just of the Delphi Hill, both in terms of performance, memory usage and robustness.
Pierre le Riche can be proud of his baby ;-)

As for multi-threaded applications, ScaleMM, once stabilized, could well become the next undisputed King of the Hill, and not just of the Delphi Hill again.

I don’t know if Embarcadero are aware of the technical lead this offers to Delphi, this is something worth some marketing buzz and MM authors support surely?

 

 

 

News, Tips , ,

Delphi XE2-64bit: bottleneck in trigonometric functions?

September 22nd, 2011

Taylor Series and Angle Reduction

In Delphi XE2 64bit, SSE2 is used to compute the trigonometric functions (cos, sin, etc.), and they are computed through what looks like Taylor series (with double-precision literals being coded in hexadecimal, likely to minimize compiler precision issues).

However Taylor series only work for small values, so when you have a large angle value, it has to be reduced in the 0 .. 2PI range, which typically involves a form of floating-point Euclidian division or exponent reduction. For typical SSE2 implementations, this means that computing a trigonometric function for a high angle value is slower, as this reduction has to be performed, typically, you’re looking at something like a 25% slowdown tops.

Bottleneck

That said, here comes iga2iga2 (in the comments) and Ville Krumlinde, which both noticed to a performance issue in Delphi XE2 64bit, especially when facing other compilers. In XE2 64 bit, the reduction is performed through a loop and a fixed-step reduction, which means that the greater the angle value, the slower it gets.

Here are some timings on a sin/cos benchmark (hundreds of thousandths of calls):

Angle value XE2-32 XE2-64
1.0 112 ms 86 ms
100 113 ms 125 ms
1e7 114 ms 3700 ms
1e14 128 ms 7600 ms

Choices, Choices, Choices

But… timing isn’t everything, when computing trigonometry for very large angles, you quickly run into numerical precision issues, and then, you basically have three options:

  • just give up, that’s actually what the FPU does in 32bits, f.i. look at the value of Sin(1e22) in Delphi 32bit, it’s… 1e22. Which is obviously not a valid sine value! And you’ve been living with that potential issue for all your 32bit life…
  • spit out something, anything, under the assumption that if the user went for such an angle, it was garbage, so garbage in, garbage out, no one will notice it, you didn’t see me do it… you can’t prove anything anyway!
  • try to be accurate, damn the timings, damn garbage in, damn the torpedoes, full precision ahead! That’s what XE2-64 is doing. I haven’t checked in details, but XE2 approach seem to be based on this approach: “argument reduction, for huge arguments: good to the last bit“, and it gets Sin(1e22) right.

Just try for Sin(1e22) in your favorite environment, the correct value is -0.8522, Delphi XE2 64bit Gets It Right, where other environments may just flash a bunch of random decimals to fool your eyes.

Update: as pointed by Daniel Bartlett in the comments, the AMD LibM library provides a much faster and similarly accurate implementation of sin/cos and other functions.

So, what gives?

If you’re after raw accuracy, you’ll have to pay for the extra execution cycles to avoid the garbage out. However, chances are, your code doesn’t have anywhere near the numerical accuracy to avoid garbage in, so no matter the precision in the reduction, you’ll still just get garbage out. And if your code was running in 32bit, chances are you had some huge garbage out already, due to the FPU giving up.

If you’re not after accuracy, f.i. if you’re just using sine/cosine for time-based animations, the extra computing precision may bite you, for no benefit, so you’re better off performing the reduction yourself, before calling sin/cos, using whatever low-precision implementation you wish.

In the long run, it might be preferable for Delphi to just adopt the GIGO approach, and keep the high precision implementations for a high precision maths library: in most situations, they won’t avoid GO because of GI, so it might be best to blend with the rest (in benchmarks).

Tips , ,

Happy {$EXCESSPRECISION OFF}!

September 9th, 2011

Just a notice: I’ve updated the XE2 single-precision floating point article after using the (up to now) undocumented {$EXCESSPRECISION OFF} directive, thanks to Allen Bauer for chiming in!

Executive summary: this directives enables use of single-precision SSE floating point instruction by the compiler, and brings their performance in line with expectations, making Delphi XE2 64bit compiler the new King of the Delphi Hill.

Edit: now documented here: Floating point precision control (Delphi for x64). That was fast!

Edit 2: an issue in the compiler was found related to non-explicitly typed constants by Ville Krumlinde, see QC #98753, so be aware of potential incorrect code generation with the initial version of the 64 bits compiler.

Edit 3: if you happen to be one of the error insight users, it will complain that the directive doesn’t exist.

Edit 4: the compiler issue was fixed for Delphi XE2 Update1!

Let’s hope that this directive will be fully supported, and in time face the same fate as $STRINGCHECKS did (ie. become another scary story for the long winter nights).

News , ,

XE2 single-precision floating point (partial) disappointment…

September 5th, 2011

In the previous episode, it appeared that Delphi XE2 64bit compiler was achieving quite good results, however, after further investigations, things may not be so clear-cut. Transcendental maths, which will be food for a another post, the subject of this one seems to be an issue with single-precision floating point maths.

edit: it appeared there is an undocumented {$EXCESSPRECISION OFF} directive which controls the generation of the conversion opcodes hampering single-precisions floating point performance, the articles has been updated. Thanks Allen Bauer, Andreano Lanusse & Leif Uneus for bringing it to attention!

Single precision

Single precision having a smaller memory footprint and being typically processed faster (especially when using SIMD, f.i. SSE allows processing 4 single-precision floats at the same time, while you can process only 2 double-precision floats at a time with SSE2), thus it is often encountered in performance-critical code where precision isn’t essential. One typical such use is for 3D computations, meshes, and geometry.

Most 3D engines out there make heavy use of single-precision floating point (GLScene and thus FireMonkey too), and it’s the primary native float data type expected by most graphics hardware.

Updated benchmark charts

However, the new 64bit compiler doesn’t like single precision floats, while the 32bit compiler likes them, this leads to this interesting chart:

Mandelbrot times (ms), lower is better

XE2 – 32 bits XE2 – 64 bits
Single Precision… 115 257 / 66*
Double Precision… 193 67

There are two figures in the 64bit single precision case, the high figure is what you see if you just compile with optimizations (yes, you’re seeing this right, single precision floating point math in Delphi 64bit behaves worse than double-precision maths in Delphi 32bits!), and the low figure is if you use the undocumented (up until this article)  {$EXCESSPRECISION OFF} directive.

The new XE 64bit compiler can give you the best, and the worst: using single precision floats can make your 64 bits code almost 4 times slower if you don’t turn off “excess precision”, while it can make 32 bits code 70% faster…

Why, oh why?

The reason? The 64bit compiler doesn’t use scalar single precision opcodes if you don’t have “excess precision”, turned off and converts everything back and forth to double precision. Here is a snippet from the CPU view:

FMandelTest.pas.193: x := x0 * x0 - y0 * y0 + p;
00000000005A1468 F3480F5AC4       cvtss2sd xmm0,xmm4
00000000005A146D F3480F5ACC       cvtss2sd xmm1,xmm4
00000000005A1472 F20F59C1         mulsd xmm0,xmm1
00000000005A1476 F3480F5ACD       cvtss2sd xmm1,xmm5
00000000005A147B F34C0F5AC5       cvtss2sd xmm8,xmm5
00000000005A1480 F2410F59C8       mulsd xmm1,xmm8
00000000005A1485 F20F5CC1         subsd xmm0,xmm1
00000000005A1489 F3480F5ACA       cvtss2sd xmm1,xmm2
00000000005A148E F20F58C1         addsd xmm0,xmm1
00000000005A1492 F2480F5AC0       cvtsd2ss xmm0,xmm0

This is the similar code as for double precision, with loads of cvtss2sd & cvtsd2ss instructions thrown in! No mulss, subss or addss in sight, and yes, you can see redundant stuff happening, 4 lines are doing the actual computation, 6 are doing conversions, and doing them every… single… time.

If you’re a fan of “Lucky Luke“, the first two lines may remind you of a Dalton brothers prison break (even though the brothers are in the same cell, they each dig their own hole to freedom) ;-)

Now if you have the  {$EXCESSPRECISION OFF} directive, you see a different picture, the compiler uses single-precision opcodes as expected:

FMandelTest.pas.194: x := x0 * x0 - y0 * y0 + p;
00000000005A1450 0F28C4           movaps xmm0,xmm4
00000000005A1453 F30F59C4         mulss xmm0,xmm4
00000000005A1457 0F28CD           movaps xmm1,xmm5
00000000005A145A F30F59CD         mulss xmm1,xmm5
00000000005A145E F30F5CC1         subss xmm0,xmm1
00000000005A1462 F30F58C2         addss xmm0,xmm2

As Ville Krumlinde pointed in the comments, VS 2010 C-compiler has the same weird behavior.

I say weird, because if you go to the length of specifying single precision floating point, it’s usually because you mean it, and it’s trivial enough to have an expression be automatically promoted to double-precision by throwing in a Double operand or cast.

This reminds me of the old $STRINGCHECKS directive, which one had to remember to adjust or suffer lower string performance. Hopefully the hand holding will be reversed in the next version, with excess precision being off by default.

News , ,

First look at XE2 floating point performance

September 2nd, 2011

With XE2 now officially out, it’s time for a first look at Delphi XE2 compiler floating point performance (see previous episode).

For a first look I’ll reuse a Mandelbrot benchmark, based on this code Mandelbrot Set in HTML 5 Canvas. What it tests are double-precision floating-point basic operations (add, sub, mult) in a tight loop, there is relatively little in the way of memory accesses (or shouldn’t be, to be more accurate).

You can find the source code see there, it compiles pretty much straight away in XE2 (just comment out  the asm for Win64).

NOTE: when this article was originally posted, I had stumbled upon an XE2 Trial version “trap” (or feature?) which basically deactivated Win64 optimizations as defined through the project options. Kenji Matumoto pointed the issue, and this is an updated article where I used {$O+} in the code to “force” optimizations. The outcome is a *much* prettier picture, I’m happy to say! Reservations from the initial articles are gone, good job Embarcadero!

edit 05/09, after further tests, I’m adding one reservation single-precision floating point doesn’t look so hot. More on the subject there.

Benchmark results

Without further ado, here are the raw figures on my machine for the 480 x 480 case, keep in mind the Delphi versions do NOT use Canvas.Pixels[], but direct memory access in an array:

Execution time in milliseconds, lower is better

Or if you prefer hard figures:

  • Delphi XE2 – 32 bits: 193 ms
  • Delphi XE2 – 64 bits: 67 ms — fastest Delphi
  • Delphi XE: 196 ms
  • FireFox 6: 121 ms
  • Chrome 13: 74 ms
  • (out of competition: XE 32bit hand-made assembly: 57 ms)

So what gives?

  • XE2 32bit compiler still uses the old FPU code, the performance delta with XE is minimal and could just be an alignment issue (pseudo-random, since the compiler doesn’t pro-actively align). Let’s hope the SSE2 codegen will be retrofitted in XE3.
  • XE2 64bit compiler get a nice boost from using SSE2, allowing it to catch up and overtake all JavaScript JITters.
  • Chrome V8 makes a good showing in this benchmark, but loses the crown, native Delphi is back on top!

A peek under the hood

What does the compiler generate for the two following lines?

x := x0 * x0 - y0 * y0 + p;
y := 2 * x0 * y0 + q;

Once you pop up the CPU view, you’ll see:

FMandelTest.pas.193: x := x0 * x0 - y0 * y0 + p;
00000000005A1452 660F28C4         movapd xmm0,xmm4
00000000005A1456 F20F59C4         mulsd xmm0,xmm4
00000000005A145A 660F28CD         movapd xmm1,xmm5
00000000005A145E F20F59CD         mulsd xmm1,xmm5
00000000005A1462 F20F5CC1         subsd xmm0,xmm1
00000000005A1466 F20F58C2         addsd xmm0,xmm2
FMandelTest.pas.194: y := 2 * x0 * y0 + q;
00000000005A146A 660F28CC         movapd xmm1,xmm4
00000000005A146E F20F590DA2000000 mulsd xmm1,qword ptr [rel $000000a2]
00000000005A1476 F20F59CD         mulsd xmm1,xmm5
00000000005A147A F20F58CB         addsd xmm1,xmm3

And further down the code, the compiler makes use of xmm8, so it’s really aware of the 16 xmm registers you have in x86-64, and finally keeps floating poitn value in registers, something the 32bit compilers (both XE & XE2) don’t do.

To what does it lose to the hand-made asm version? Well a handful of minor things:

  • even though it used up to 9 xmm registers, it didn’t use 10th, leaving some memory access
  • with more careful allocation, it could have fit everything in 8 xmm registers, which would have cut unnecessary traffic
  • it zeroes register with a move from memory,  didn’t do constant unification or propagation.

Still those are mostly nitpickings compared to the massive issues of the old FPU code compilation (which, alas XE2 – Win32 still suffers from).

Conclusion

Support for SSE2 in XE2 64bit compiler consists in a significant step ahead for Delphi floating point performance. XE2 32bit is still same old.

If you’re doing heavy floating point maths, XE2 64bit compiler is a simple ticket to much better performance.

Hopefully in Delphi XE3 they will retrofitting the SSE2 codegen into the 32bit compiler, but ad interim it should quell all the critics about “we don’t need no 64bit”, well, if you do any significant floating-point maths, Delphi XE2 64bit is a must!

News , , ,

Optimistic Unicode case-insensitive CompareText

May 4th, 2011

In Unicode Delphi, post-Delphi 2009, there are two ways of making case-insensitive string comparisons, CompareText, which only does case-insensitivity in the ASCII range (non-accentuated characters), and the judiciously misnommed AnsiCompareText, which works on the whole Unicode range by calling into the Windows API.

Alas, AnsiCompareText is slow, very slow, as illustrated by TStringList in this post f.i., not just because Microsoft didn’t do a good job at implementing it, but also because the Unicode Consortium did its very best to scatter case-sensitive characters across the whole set… In addition you have special locale-based collation rules to deal with, which have to be applied independently from Unicode.

However, if you’re dealing with primarily latin-based strings, and looking at case-sensitive matching (rather than ordering), it’s possible to cut down significantly on the complexity of AnsiCompareText by replacing it with an “Optimistic UnicodeCompareText“, which will use an ASCII-based strategy like CompareText does, and fallback to the Windows API function when, and only when a non-ASCII character is encountered.

Of course, it won’t help with Cyrillic or Greek, but t will work well for most western-languages, where accentuated and special characters are infrequent (such as german or french).

You can find such a UnicodeCompareText function in DWScript‘s dwsUtils.pas unit, it’s being used to add support for Unicode identifiers in DWS, without incurring a performance penalty for all the pure ASCII identifiers.

Tips ,

The limitations of Delphi’s “inline”

February 8th, 2011

Sometimes, the most simple-looking code can cause the Delphi compiler to stumble.

I bumped on such a case recently, and simplified it to a bare-bones version that still exhibits the issue:

type
   TFloatRec = record
      private
         Field : Double;
      public
         function RecGet : Double; inline;
   end;

   TMyClass = class
      private
         FRec : TFloatRec;
      public
         function Get : Double; virtual;
   end;

function TFloatRec.Get : Double;
begin
   Result:=Field; // here you could do a computation instead
end;

function TMyClass.Get : Double;
begin
   Result:=FRec.RecGet;
end;

Basically all you have are trivial functions that return the value of a floating-point field.

Given the above, for the TMyClass.Get method, the optimal codegen would look just like

fld qword ptr [eax+8]
ret

Simple enough, eh? Yet here is what the Delphi XE compiler generates:

Unit1.pas.326: begin
0053A794 83C4F0           add esp,-$10
Unit1.pas.327: Result:=FRec.Get;
0053A797 83C008           add eax,$08
0053A79A 8B10             mov edx,[eax]
0053A79C 89542408         mov [esp+$08],edx
0053A7A0 8B5004           mov edx,[eax+$04]
0053A7A3 8954240C         mov [esp+$0c],edx
0053A7A7 8B442408         mov eax,[esp+$08]
0053A7AB 890424           mov [esp],eax
0053A7AE 8B44240C         mov eax,[esp+$0c]
0053A7B2 89442404         mov [esp+$04],eax
Unit1.pas.328: end;
0053A7B6 DD0424           fld qword ptr [esp]
0053A7B9 83C410           add esp,$10
0053A7BC C3               ret

for the less-asm fluent, a direct pseudo-pascal translation of the above would be

var
   p : PDouble;
   temp1, temp2 : Double;
begin
   p:=@FRec.Field;
   temp1:=p^;
   temp2:=temp1;
   Result:=temp2;
end;

And if TMyClass.Get is not virtual, but a static method with “inline”, you get the above with a third temp3” Double (ie. it will perform even worse).

The above trips to temporaries aren’t innocuous, because those temporaries are in the stack, and result in stalls as the CPU pipeline waits for the roundtrips to L1 memory cache to happen. In practice, a single of those stalls can take as much time as half a dozen floating operations.

To get rid of the temporaries, there are two options: you can manually inline everything (the RecGet & the Get) to get rid of the temporaries, of course, that doesn’t sit too well with encapsulation, or with virtual calls for that matter.

Or you can use inline-asm instead, a single instruction of asm being enough, and even with calls betweens the functions, it will be running circles around the Delphi compiler’s “inline” output.

Tips , ,