Archive

Posts Tagged ‘Delphi’

Pimp your random numbers with XorShift!

December 13th, 2011

A 64bit XorShift is now used to generate random numbers in DWScript, and there is a now a separate random number generator per-execution, which is auto-randomized when an execution is created.

Previously, the RTL random generator was used, this was “okay” when you had only one script using random numbers at a time, but multiple scripts running at the same time would interfere (Randomize calls would affect each others f.i.), and Random isn’t really thread-safe.

Performance fo XorShift is roughly comparable to the Delphi RTL’s linear congruential generator, but with much better statistical random properties and a very long period, without the overhead of a Mersenne Twister. For those interested in the mathematical details, see “XorShift RNGs” paper by G. Marsagalia.

As an illustration of the improved random properties, consider filling a bitmap with “random” RGB colors for each pixel:

var x, y : Integer;
for x := 0 to bmp.Width-1 do
   for y := 0 to bmp.Height-1 do
      bmp.Pixel[x, y] := RandomInt($1000000);

Using the Delphi built-in Random, you’ll get something like the image below (generated at 512×512, then halved and downgraded to 4bpp for web consumption)

Delphi RTL Random

Oooh… the horizontal scratch lines! Not so random after all… I don’t know if the Delphi LCG is as biased as RANDU, but visibly, it is probably not something you want to rely upon too much.

And now, the same but with the XorShift implementation now used in DWS:

DWScript XorShift Random

The  XorShift implementation is very simple, fast, and doesn’t require much memory: a single 64bit value is enough to get good random, use two if you want longer periods that won’t have a chance to loop before the universe ends.

Last but not least, 64bit XorShift may be fast in 32bit binaries, but it practically walks on water in 64bit binaries ;-)

News , , , , ,

Don’t publish your .dproj/.groupproj

November 10th, 2011

Just a quick reminder to everyone publishing Delphi projects with source:

Please don’t publish your .dproj & .groupproj, only publish the .dpr & .dpk

The reason? Those files include machine specific settings, such as paths, DCU/DCP/BPL/EXE output directories, along with your favorite debug & release options, which are likely different from that of your fellow developer.

It’s possible to have them manually cleaned up, but that’s tedious and error-prone short of checking their xml content manually.

Pretty much every single project with a .dproj out there has issues: that’s from major open-source projects to Embarcadero’s own samples. None of them (of you) got all of them cleaned up right.

But even getting the published .dproj right doesn’t matter: .dproj is where compile options are stored, options you’re just bound to change and adjust. When those .dproj are in a project you synchronize with via version control (SVN, GIT, etc.), your locally modified .dproj will likely conflict next time you synchronize, sometimes in unintended and not immediately obvious ways.

Hopefully in a future version, Embarcadero will split the .dproj, so that machine-specific settings are in a distinct file from the non-machine specific settings, which would essentially be per-project relative paths to the source files.

Ad interim, .dproj are just a kludge by design.

Tips

Memory Manager Investigations

October 13th, 2011

André Mussche on Google+ investigated the performance of several Memory Managers for Delphi, in single-threaded & multi-threaded situations, with detailed results and charts on performance and memory usage. Great work and interesting findings!

His conclusions (which I share)

For single threaded or low memory profile applications, the default Delphi memory manager (FastMM) is the fastest you can get. If you don’t realloc a lot (strings?), TCmalloc [from Google perftools] is fast too.

For multi threaded apps, it’s not easy to decided what to use. ScaleMM2 is the fastest but not stable. TCmalloc is a good one, but uses a lot of memory. MSVCRT [Microsoft allocator in msvcrt.dll] looks scalable in simple multi-threaded tests, but in extended test like FastCodeMMChallenge it is disappointing: slower and uses a lot of memory!
JeMalloc (used by the latest FireFox) is disappointing in multi-threaded areas, but uses the same low memory as FastMM: maybe FF can be made faster by using FastMM? :-)

Additionally, Hoard was tested, though it performed “off the charts” (in a literal and bad way).

You can check André’s charts for yourself:

All in all, for single-threaded applications, or when you have few threads or limited thread-based memory management, FastMM is still king of the Hill, and not just of the Delphi Hill, both in terms of performance, memory usage and robustness.
Pierre le Riche can be proud of his baby ;-)

As for multi-threaded applications, ScaleMM, once stabilized, could well become the next undisputed King of the Hill, and not just of the Delphi Hill again.

I don’t know if Embarcadero are aware of the technical lead this offers to Delphi, this is something worth some marketing buzz and MM authors support surely?

 

 

 

News, Tips , ,

Delphi XE2-64bit: bottleneck in trigonometric functions?

September 22nd, 2011

Taylor Series and Angle Reduction

In Delphi XE2 64bit, SSE2 is used to compute the trigonometric functions (cos, sin, etc.), and they are computed through what looks like Taylor series (with double-precision literals being coded in hexadecimal, likely to minimize compiler precision issues).

However Taylor series only work for small values, so when you have a large angle value, it has to be reduced in the 0 .. 2PI range, which typically involves a form of floating-point Euclidian division or exponent reduction. For typical SSE2 implementations, this means that computing a trigonometric function for a high angle value is slower, as this reduction has to be performed, typically, you’re looking at something like a 25% slowdown tops.

Bottleneck

That said, here comes iga2iga2 (in the comments) and Ville Krumlinde, which both noticed to a performance issue in Delphi XE2 64bit, especially when facing other compilers. In XE2 64 bit, the reduction is performed through a loop and a fixed-step reduction, which means that the greater the angle value, the slower it gets.

Here are some timings on a sin/cos benchmark (hundreds of thousandths of calls):

Angle value XE2-32 XE2-64
1.0 112 ms 86 ms
100 113 ms 125 ms
1e7 114 ms 3700 ms
1e14 128 ms 7600 ms

Choices, Choices, Choices

But… timing isn’t everything, when computing trigonometry for very large angles, you quickly run into numerical precision issues, and then, you basically have three options:

  • just give up, that’s actually what the FPU does in 32bits, f.i. look at the value of Sin(1e22) in Delphi 32bit, it’s… 1e22. Which is obviously not a valid sine value! And you’ve been living with that potential issue for all your 32bit life…
  • spit out something, anything, under the assumption that if the user went for such an angle, it was garbage, so garbage in, garbage out, no one will notice it, you didn’t see me do it… you can’t prove anything anyway!
  • try to be accurate, damn the timings, damn garbage in, damn the torpedoes, full precision ahead! That’s what XE2-64 is doing. I haven’t checked in details, but XE2 approach seem to be based on this approach: “argument reduction, for huge arguments: good to the last bit“, and it gets Sin(1e22) right.

Just try for Sin(1e22) in your favorite environment, the correct value is -0.8522, Delphi XE2 64bit Gets It Right, where other environments may just flash a bunch of random decimals to fool your eyes.

Update: as pointed by Daniel Bartlett in the comments, the AMD LibM library provides a much faster and similarly accurate implementation of sin/cos and other functions.

So, what gives?

If you’re after raw accuracy, you’ll have to pay for the extra execution cycles to avoid the garbage out. However, chances are, your code doesn’t have anywhere near the numerical accuracy to avoid garbage in, so no matter the precision in the reduction, you’ll still just get garbage out. And if your code was running in 32bit, chances are you had some huge garbage out already, due to the FPU giving up.

If you’re not after accuracy, f.i. if you’re just using sine/cosine for time-based animations, the extra computing precision may bite you, for no benefit, so you’re better off performing the reduction yourself, before calling sin/cos, using whatever low-precision implementation you wish.

In the long run, it might be preferable for Delphi to just adopt the GIGO approach, and keep the high precision implementations for a high precision maths library: in most situations, they won’t avoid GO because of GI, so it might be best to blend with the rest (in benchmarks).

Tips , ,

Happy {$EXCESSPRECISION OFF}!

September 9th, 2011

Just a notice: I’ve updated the XE2 single-precision floating point article after using the (up to now) undocumented {$EXCESSPRECISION OFF} directive, thanks to Allen Bauer for chiming in!

Executive summary: this directives enables use of single-precision SSE floating point instruction by the compiler, and brings their performance in line with expectations, making Delphi XE2 64bit compiler the new King of the Delphi Hill.

Edit: now documented here: Floating point precision control (Delphi for x64). That was fast!

Edit 2: an issue in the compiler was found related to non-explicitly typed constants by Ville Krumlinde, see QC #98753, so be aware of potential incorrect code generation with the initial version of the 64 bits compiler.

Edit 3: if you happen to be one of the error insight users, it will complain that the directive doesn’t exist.

Edit 4: the compiler issue was fixed for Delphi XE2 Update1!

Let’s hope that this directive will be fully supported, and in time face the same fate as $STRINGCHECKS did (ie. become another scary story for the long winter nights).

News , ,

XE2 single-precision floating point (partial) disappointment…

September 5th, 2011

In the previous episode, it appeared that Delphi XE2 64bit compiler was achieving quite good results, however, after further investigations, things may not be so clear-cut. Transcendental maths, which will be food for a another post, the subject of this one seems to be an issue with single-precision floating point maths.

edit: it appeared there is an undocumented {$EXCESSPRECISION OFF} directive which controls the generation of the conversion opcodes hampering single-precisions floating point performance, the articles has been updated. Thanks Allen Bauer, Andreano Lanusse & Leif Uneus for bringing it to attention!

Single precision

Single precision having a smaller memory footprint and being typically processed faster (especially when using SIMD, f.i. SSE allows processing 4 single-precision floats at the same time, while you can process only 2 double-precision floats at a time with SSE2), thus it is often encountered in performance-critical code where precision isn’t essential. One typical such use is for 3D computations, meshes, and geometry.

Most 3D engines out there make heavy use of single-precision floating point (GLScene and thus FireMonkey too), and it’s the primary native float data type expected by most graphics hardware.

Updated benchmark charts

However, the new 64bit compiler doesn’t like single precision floats, while the 32bit compiler likes them, this leads to this interesting chart:

Mandelbrot times (ms), lower is better

XE2 – 32 bits XE2 – 64 bits
Single Precision… 115 257 / 66*
Double Precision… 193 67

There are two figures in the 64bit single precision case, the high figure is what you see if you just compile with optimizations (yes, you’re seeing this right, single precision floating point math in Delphi 64bit behaves worse than double-precision maths in Delphi 32bits!), and the low figure is if you use the undocumented (up until this article)  {$EXCESSPRECISION OFF} directive.

The new XE 64bit compiler can give you the best, and the worst: using single precision floats can make your 64 bits code almost 4 times slower if you don’t turn off “excess precision”, while it can make 32 bits code 70% faster…

Why, oh why?

The reason? The 64bit compiler doesn’t use scalar single precision opcodes if you don’t have “excess precision”, turned off and converts everything back and forth to double precision. Here is a snippet from the CPU view:

FMandelTest.pas.193: x := x0 * x0 - y0 * y0 + p;
00000000005A1468 F3480F5AC4       cvtss2sd xmm0,xmm4
00000000005A146D F3480F5ACC       cvtss2sd xmm1,xmm4
00000000005A1472 F20F59C1         mulsd xmm0,xmm1
00000000005A1476 F3480F5ACD       cvtss2sd xmm1,xmm5
00000000005A147B F34C0F5AC5       cvtss2sd xmm8,xmm5
00000000005A1480 F2410F59C8       mulsd xmm1,xmm8
00000000005A1485 F20F5CC1         subsd xmm0,xmm1
00000000005A1489 F3480F5ACA       cvtss2sd xmm1,xmm2
00000000005A148E F20F58C1         addsd xmm0,xmm1
00000000005A1492 F2480F5AC0       cvtsd2ss xmm0,xmm0

This is the similar code as for double precision, with loads of cvtss2sd & cvtsd2ss instructions thrown in! No mulss, subss or addss in sight, and yes, you can see redundant stuff happening, 4 lines are doing the actual computation, 6 are doing conversions, and doing them every… single… time.

If you’re a fan of “Lucky Luke“, the first two lines may remind you of a Dalton brothers prison break (even though the brothers are in the same cell, they each dig their own hole to freedom) ;-)

Now if you have the  {$EXCESSPRECISION OFF} directive, you see a different picture, the compiler uses single-precision opcodes as expected:

FMandelTest.pas.194: x := x0 * x0 - y0 * y0 + p;
00000000005A1450 0F28C4           movaps xmm0,xmm4
00000000005A1453 F30F59C4         mulss xmm0,xmm4
00000000005A1457 0F28CD           movaps xmm1,xmm5
00000000005A145A F30F59CD         mulss xmm1,xmm5
00000000005A145E F30F5CC1         subss xmm0,xmm1
00000000005A1462 F30F58C2         addss xmm0,xmm2

As Ville Krumlinde pointed in the comments, VS 2010 C-compiler has the same weird behavior.

I say weird, because if you go to the length of specifying single precision floating point, it’s usually because you mean it, and it’s trivial enough to have an expression be automatically promoted to double-precision by throwing in a Double operand or cast.

This reminds me of the old $STRINGCHECKS directive, which one had to remember to adjust or suffer lower string performance. Hopefully the hand holding will be reversed in the next version, with excess precision being off by default.

News , ,

First look at XE2 floating point performance

September 2nd, 2011

With XE2 now officially out, it’s time for a first look at Delphi XE2 compiler floating point performance (see previous episode).

For a first look I’ll reuse a Mandelbrot benchmark, based on this code Mandelbrot Set in HTML 5 Canvas. What it tests are double-precision floating-point basic operations (add, sub, mult) in a tight loop, there is relatively little in the way of memory accesses (or shouldn’t be, to be more accurate).

You can find the source code see there, it compiles pretty much straight away in XE2 (just comment out  the asm for Win64).

NOTE: when this article was originally posted, I had stumbled upon an XE2 Trial version “trap” (or feature?) which basically deactivated Win64 optimizations as defined through the project options. Kenji Matumoto pointed the issue, and this is an updated article where I used {$O+} in the code to “force” optimizations. The outcome is a *much* prettier picture, I’m happy to say! Reservations from the initial articles are gone, good job Embarcadero!

edit 05/09, after further tests, I’m adding one reservation single-precision floating point doesn’t look so hot. More on the subject there.

Benchmark results

Without further ado, here are the raw figures on my machine for the 480 x 480 case, keep in mind the Delphi versions do NOT use Canvas.Pixels[], but direct memory access in an array:

Execution time in milliseconds, lower is better

Or if you prefer hard figures:

  • Delphi XE2 – 32 bits: 193 ms
  • Delphi XE2 – 64 bits: 67 ms — fastest Delphi
  • Delphi XE: 196 ms
  • FireFox 6: 121 ms
  • Chrome 13: 74 ms
  • (out of competition: XE 32bit hand-made assembly: 57 ms)

So what gives?

  • XE2 32bit compiler still uses the old FPU code, the performance delta with XE is minimal and could just be an alignment issue (pseudo-random, since the compiler doesn’t pro-actively align). Let’s hope the SSE2 codegen will be retrofitted in XE3.
  • XE2 64bit compiler get a nice boost from using SSE2, allowing it to catch up and overtake all JavaScript JITters.
  • Chrome V8 makes a good showing in this benchmark, but loses the crown, native Delphi is back on top!

A peek under the hood

What does the compiler generate for the two following lines?

x := x0 * x0 - y0 * y0 + p;
y := 2 * x0 * y0 + q;

Once you pop up the CPU view, you’ll see:

FMandelTest.pas.193: x := x0 * x0 - y0 * y0 + p;
00000000005A1452 660F28C4         movapd xmm0,xmm4
00000000005A1456 F20F59C4         mulsd xmm0,xmm4
00000000005A145A 660F28CD         movapd xmm1,xmm5
00000000005A145E F20F59CD         mulsd xmm1,xmm5
00000000005A1462 F20F5CC1         subsd xmm0,xmm1
00000000005A1466 F20F58C2         addsd xmm0,xmm2
FMandelTest.pas.194: y := 2 * x0 * y0 + q;
00000000005A146A 660F28CC         movapd xmm1,xmm4
00000000005A146E F20F590DA2000000 mulsd xmm1,qword ptr [rel $000000a2]
00000000005A1476 F20F59CD         mulsd xmm1,xmm5
00000000005A147A F20F58CB         addsd xmm1,xmm3

And further down the code, the compiler makes use of xmm8, so it’s really aware of the 16 xmm registers you have in x86-64, and finally keeps floating poitn value in registers, something the 32bit compilers (both XE & XE2) don’t do.

To what does it lose to the hand-made asm version? Well a handful of minor things:

  • even though it used up to 9 xmm registers, it didn’t use 10th, leaving some memory access
  • with more careful allocation, it could have fit everything in 8 xmm registers, which would have cut unnecessary traffic
  • it zeroes register with a move from memory,  didn’t do constant unification or propagation.

Still those are mostly nitpickings compared to the massive issues of the old FPU code compilation (which, alas XE2 – Win32 still suffers from).

Conclusion

Support for SSE2 in XE2 64bit compiler consists in a significant step ahead for Delphi floating point performance. XE2 32bit is still same old.

If you’re doing heavy floating point maths, XE2 64bit compiler is a simple ticket to much better performance.

Hopefully in Delphi XE3 they will retrofitting the SSE2 codegen into the 32bit compiler, but ad interim it should quell all the critics about “we don’t need no 64bit”, well, if you do any significant floating-point maths, Delphi XE2 64bit is a must!

News , , ,

What innocuous-looking unit tests can uncover…

August 17th, 2011
Comments Off

I’ve recently been adding DWScript snippets to Rosetta Code, using them as unit tests as well.

Quite a few of Rosetta Code’s tasks consist in mathematical tasks, and I was wondering, how many math tests do you really need?

Well, quite a few! While implementing the Lucas-Lehmer test, it ended up hitting the precision boundary quite sooner than it should theoretically had, given that DWScript’s Integer is actually a 64bit integer.

Some investigations in the CPU view later turned out that the Delphi compiler did not  generate the proper CPU instructions for Sqr() in the case of integers, which DWS was relying upon. Apparently this has been QC’ed many times since Delphi 5, but still exists to this day in Delphi XE. The issue is now worked around in the SVN.

Fixed for XE2? Let’s hope, there may still be time…

 

 

Tips , ,

A Fistful of TMonitors

May 31st, 2011

…or why you can’t hide under the complexity carpet ;-)

As uncovered in previous episodes, one of the keys behind TMonitor performance issues is that it allocates a dynamic block of memory for its locking purposes, and when those blocks end up allocated on the same CPU cache line, the two TMonitor on the same cache line will end up fighting for the cache line, resulting in a drastic drop of performance and thread contention. The technical term for that behavior is false sharing.

A quick fix that can come to mind would be to force the allocation of TMonitor’s blocks early on, so that the blocks don’t end up contiguous, and hope and pray that in more complex situations, this will happen automagically.

Alas, that’s a fragile solution, for instance if you take the code in the link mentioned above, you’ll find it doesn’t work all that well:

  • run the same untouched test on different CPUs with larger cache lines or different cache associativity, and the contention can be back
  • instantiate a different class than TInterfaceList, or subclass it and add a few fields to it, and the contention is back

Why is that?

First, different CPU have different cache lines and associativity, so if you have cache-line size dependent code, you need to ask Windows about it. See for instance “How do I determine the processor’s cache line size?“.

Second, you don’t have control on how contiguous dynamic memory will be. FastMM f.i. is a bucket-based allocator, blocks that fall in the same bucket size will be allocated in sequence, in the previous code, with the empty TInterfaceList, you’ll have (optimistically*) allocated something like:

  • TInterfaceList instance 1
  • TMonitor 1 dynamic data
  • TInterfaceList instance 2
  • TMonitor 2 dynamic data

Which makes both monitor’s dynamic data non-contiguous, and if that’s enough to have both TMonitor’s data end up on different cache lines, the test will fly. But if you don’t have some other dynamic data that is of the appropriate size? the TMonitor’s data will still be contiguous…

*: in practice, even if the same buckets are involved, there is no guarantee the memory order will be the above, as FastMM recycles buckets, so the exact order can depend on the order in which previously allocated buckets of the same size were freed.

Note that if in your application’s code, you don’t have any other dynamic data that happens to fall in just the same bucket size as TMonitor’s data, all your TMonitor are likely to be contiguous (and even more so if you tend to allocate stuff first, and then run it, without manually pre-allocating the TMonitors).

In the above code, raw TInterfaceList instances are 24 bytes in size, and happen to fall in the same bucket as TMonitor’s 28 bytes data (the 32 bytes bucket).

With a linear garbage-collected allocator, similar contiguousness issues can appear after a garbage collection’s compaction, even if linear allocation was used initially and separated the blocks.

An interesting weakness can  also be exposed: a TMonitor’s data (inherently shared) can end up sitting in the middle of thread-specific dynamic data, resulting in another form of false sharing. In that case, TMonitor will not fight with another TMonitor for the cache line, but with your own code and dynamic data.

Why is TRTLCriticalSection not as vulnerable?

After all TRTLCriticalSection is only 24 bytes in size, and thus, smaller than a cache line?

Well it benefits from being a record, and thus usually not dynamically allocated on its own, but as part of a larger structure/object, which reduces the risks of it being on its own cache line (though if you’re not careful, you can easily end up with false sharing with the other owner object’s fields f.i.).

Note that TCriticalSection dynamically allocates the space for a TRTLCriticalSection, and thus can partially exhibit the false sharing issues that can plague TMonitor’s dynamic data.

Conclusions

The only way to be safe from false sharing, is to allocate large enough blocks, so that you guarantee they use a distinct cache line. In TMonitor’s case, the fix would be to allocate a larger block, rather than a small 28 bytes block as is currently the case.

Ideally, TCriticalSection instances should also be made larger, so their only drawback compared to TRTLCriticalSection would be the (rather negligible) virtual call overhead.

Multi-threading is hard, when you spot a simple problem in a simple test, don’t try to hide it under the complexity carpet, fix it while it’s still simple ;-)

Tips ,

Once upon a time in a thread…

May 26th, 2011

Last episode in the TMonitor saga. In the previous episode, Chris Rolliston posted a more complete test case, for which he got surprising results (including that a Critical Section approach wouldn’t scale with the thread count). Starting from  his code I initially also got similar surprising results.

edit: apparently the “crash” part of the TMonitor issues have been acknowledged by the powers that be, and a hotfix could be on the way, though it points back to QC 78415, an issue reported in 2009, ouch. Guess those 4 bytes per instance haven’t seen much use…

Revised Test with Stable Results

I simplified his code (see below), by dropping the usage of several RTL classes and features, and went for a straightforward implementation, in the process, the oddities went away as far as Critical Section is concerned, and partially so as far as TMonitor goes…
The results can be summarized by this chart:

This was measured on a quad-core, as you can see the Critical Section version stays flat until the number of threads gets greater than the core count, at which point, there is a small ramping arising from the workload taking its toll. TMonitor is a different story, if the revised test doesn’t exhibit the poor scaling I was finding in my previous test, there is still a ramping,  as well as a wild jump once there are more threads than cores.

Which RTL class or what exactly was the source of the behavior in Chris’s original code, I don’t know. One possible cause pointed by Krystian in a former comment could be that instances can end up in the same cache line, though that doesn’t explain everything, it could be a start is major factor.

Note that TMonitor allocates its own small block for its locking purposes, distinct from the object instance, and AFAICT there are no provisions in case those blocks end up in the same cache line, though I’m not convinced yet that’s the issue we’re seeing here, this could be a source of contention.

edit: Krystian posted some sample code with cache-line collision avoidance, with it TMonitor becomes much more linear, though half as fast as a CS, and there are still occasional spurious slowdowns showing up in the timings.

Test Code

Here is the test code used for the above, if you test on your machine, make sure you have selected the high performance profile in Windows Power options, and that you don’t have any implicit affinity settings kicking in on the executable.

You can call the above code from a form where you’ll have dropped a TMemo to use as log, as I’m assuming you don’t want to slum it in a command line executable ;-)

const
   cCountdownFrom = $FFFFFF; //increase if necessary...
   cMaxThreads = 10;

type
   TTestThread = class(TThread);

   TTestThreadClass = class of TTestThread;

   TCriticalSectionThread = class(TTestThread)
      FCriticalSection: TRTLCriticalSection;
      procedure Execute; override;
   end;

   TMonitorThread = class(TTestThread)
      procedure Execute; override;
   end;

procedure RunTest(log : TStrings; const testName : String; threadCount : Integer;
                  threadClass : TTestThreadClass);
var
   i : Integer;
   threads : array of TThread;
   tstop, tstart, freq : Int64;
begin
   SetLength(threads, threadCount);

   for i:=0 to threadCount-1 do
      threads[i]:=threadClass.Create(True);

   QueryPerformanceCounter(tstart);

   for i:=0 to threadCount-1 do
      threads[i].Start;
   for i:=0 to threadCount-1 do
      threads[i].WaitFor;

   QueryPerformanceCounter(tstop);
   QueryPerformanceFrequency(freq);

   log.Add(Format('%s: %d thread(s) took %.1f ms',
                  [testName, threadCount, (tstop-tstart)*1000/freq]));

   for i:=0 to threadCount-1 do
      threads[i].Free;
end;

procedure TCriticalSectionThread.Execute;
var
   counter : Integer;
begin
   InitializeCriticalSection(FCriticalSection);

   counter:=cCountdownFrom;
   repeat
      EnterCriticalSection(FCriticalSection);
      try
         Dec(counter);
      finally
         LeaveCriticalSection(FCriticalSection);
      end;
   until counter<=0;

   DeleteCriticalSection(FCriticalSection);
end;

procedure TMonitorThread.Execute;
var
   counter : Integer;
begin
   counter:=cCountdownFrom;
   repeat
      System.TMonitor.Enter(Self);
      try
         Dec(counter);
      finally
         System.TMonitor.Exit(Self);
      end;
   until counter<=0;
end;

procedure RevisedChrisTest(log : TStrings);
var
   i, j : Integer;
begin
   for i:=1 to 3 do begin
      log.Add('*** ROUND '+IntToStr(i)+' ***');
      for j:=1 to cMaxThreads do begin
         RunTest(log, 'TCriticalSection', j, TCriticalSectionThread);
         RunTest(log, 'TMonitor', j, TMonitorThread);
      end;
   end;
end;

Tips ,