TMonitor vs TRTLCriticalSection

In this new episode of the TMonitor saga is a comparison of its locking features vs TRTLCriticalSection, aka the OS-supported critical section mechanism. Is it fair? is it fast?

Fairness means that if you have several threads attempting to lock the same resources, they’ll all get a chance and none of them gets the lion’s share. In other words, that there are no race conditions.

Speed is obvious, it’s about the overhead the locking mechanism incurs.

Edit 2013-08-24: looks like the issue discussed here should be resolved in XE5, cf Monitoring the Monitor by Allen Bauer.

Test case

The test case presented here aims to not just stress the locking mechanism, but also measure its fairness.

A simple way to do that is to start multiple threads, and then count over a given time slice, how many time each thread could acquire the lock. Not that in this particular test cases, the issue uncovered in previous episodes don’t apply, since there is a single lock.

For that we can use the following structure

type TData = array [0..SAMPLES-1, 0..NBTHREADS-1] of Integer;

   Data : TData;
   Counter : Integer;

And each of the thread will then have the following workload:

procedure TTestThread.Execute;
   while not Terminated do begin
         if Counter>=NB then
         else begin
            Inc(Data[Counter shr SAMPLES_BITS, N]);

So each threads increments a counter, then increments the number of times it could acquire the counter over a given counter range of 2^SAMPLE_BITS samples.

The resulting array can then easily be visualized.


Running the test case with four threads on a (true) quad-core CPU, here is the raw resulting array:


Critical sections on top, TMonitor on bottom

What immediately jumps out is that TMonitor is fair, very fair, while critical sections don’t look quite so fair.

However, the timings do tell a different story… TMonitor version executes in 1.36 sec, while critical sections version executed in just 0.1 sec, so 13 times faster… When you scale the graphic in proportion to the time taken, the outcome becomes:


So while critical sections still aren’t as perfectly fair, when you put things in perspective, the fairness of critical sections looks far less problematic. And the performance is just not in the same league.

Keep in mind that with higher locking performance, the duration of locks will be reduced, which in turns means that contention is less likely occur. And low locking performance means that contention can become a problem in situations where it just shouldn’t have been a problem.

As an illustration, in the previous test case:

  • Critical Sections: system CPU usage is around 70%, so almost 3 threads are busy at any time, only one is waiting on the lock.
  • TMonitor: system CPU usage is 25%, meaning that everything is serialized, and only one thread is active at any time.

So even if the code was very short and very contention-prone, and some of the CPU usage could be spinlocks rather than actual work, there is obviously some potential for parallelization.

The test gave similar results for XE and XE2, both in 32bit & 64bit mode. XE3 results are similar but slightly slower (TMonitor being 14.5 times slower instead of 13 times like with XE/XE2).  If any one wants to test against XE4 or on other CPU, feel free to download the  source code and send me the binary, so I can test on the same machine for comparison purposes.


The greater fairness of TMonitor comes not only at an execution overhead cost, but also at a greater contention cost. It’s likely the fairness was actually a side-effect of the greater contention (and effective serialization).

If you’re serious about multi-threading performance, TMonitor should not appear in your code. At all.

edit: While nothing is confirmed for XE4 yet, things actually got a little worse with XE3, and XE4 is identical to XE3 (cf. comments, thanks Chris).


12 thoughts on “TMonitor vs TRTLCriticalSection

  1. Very interesting! Did you ever look into the differences between TCriticalSection and TRTLCriticalSection (if any)? I actually use TFixedCriticalSection (TCriticalSection with the extra 96 bytes).

  2. @Mike Versteeg TCriticalSection is a thin wrapper around TRTLCriticalSection, once you avoid the cache collision issue with the extra bytes, the performance should be the same (and only marginally slower if you turn off inlining).

  3. Mind you, the strange thing was that the first time I ran the code (compiled in XE4) I got 0.086 – 4.230 but each subsequent test gets me the results around 0.079 – 10.342. I cant for the life of me account for the times in that first run. @Chris

  4. How about my simple locks?

    procedure Lock(var Lk: integer);
    {$IFDEF CPUX64}
    mov rdx, rcx
    mov ecx, 1
    mov eax, 0
    lock cmpxchg dword ptr [rdx], ecx
    jnz @Loop
    {$ELSE CPUX64}
    mov edx, eax
    mov ecx, 1
    mov eax, 0
    lock cmpxchg dword ptr [edx], ecx
    jnz @Loop
    {$ENDIF CPUX64}

    procedure UnLock(var Lk: integer);
    {$IFDEF CPUX64}
    lock dec dword ptr[rcx];
    {$ELSE CPUX64}
    lock dec dword ptr[eax];
    {$ENDIF CPUX64}

  5. Imho it’s not surprising at all, that TMonitor is slower than a critical section. Just look at the mass of code that comes with TMonitor – which is due to the fact, that it has more features obviously.

    The actual ratio is a bit surprising though …

  6. Olaf Monien :

    that it has more features obviously.

    Given the issues on simple locking, I have my doubts these more advanced features are actually usable or doing what they advertize.

  7. @Eric
    These advanced features are certainly usable, the question more what price are you willing to pay? I think that simplicity is “more”, thus would recommend using “simple” things like critical sections where possible …

  8. @Vitali
    It will be efficient, for sure, but it will not release the thread until it accesses the locked semaphore.

    I suspect a Windows Critical Section will be more versatile, since in most case it will run as fast as your own code (in most case, Critical Section API will stay in user land and will use only a few asm opcodes), but it will work as expected to release the thread and not burn CPU just for waiting on contention.

Comments are closed.