In this new episode of the TMonitor saga [1] is a comparison of its locking features vs TRTLCriticalSection, aka the OS-supported critical section mechanism. Is it fair? is it fast?
Fairness means that if you have several threads attempting to lock the same resources, they’ll all get a chance and none of them gets the lion’s share. In other words, that there are no race conditions [2].
Speed is obvious, it’s about the overhead the locking mechanism incurs.
Edit 2013-08-24: looks like the issue discussed here should be resolved in XE5, cf Monitoring the Monitor [3] by Allen Bauer.
Test case
The test case presented here aims to not just stress the locking mechanism, but also measure its fairness.
A simple way to do that is to start multiple threads, and then count over a given time slice, how many time each thread could acquire the lock. Not that in this particular test cases, the issue uncovered in previous episodes don’t apply, since there is a single lock.
For that we can use the following structure
type TData = array [0..SAMPLES-1, 0..NBTHREADS-1] of Integer; var Data : TData; Counter : Integer;
And each of the thread will then have the following workload:
procedure TTestThread.Execute; begin while not Terminated do begin LockData; try if Counter>=NB then Terminate else begin Inc(Counter); Inc(Data[Counter shr SAMPLES_BITS, N]); end; finally UnLockData; end; end; end;
So each threads increments a counter, then increments the number of times it could acquire the counter over a given counter range of 2^SAMPLE_BITS samples.
The resulting array can then easily be visualized.
Outcome
Running the test case with four threads on a (true) quad-core CPU, here is the raw resulting array:
Critical sections on top, TMonitor on bottom
What immediately jumps out is that TMonitor is fair, very fair, while critical sections don’t look quite so fair.
However, the timings do tell a different story… TMonitor version executes in 1.36 sec, while critical sections version executed in just 0.1 sec, so 13 times faster… When you scale the graphic in proportion to the time taken, the outcome becomes:
So while critical sections still aren’t as perfectly fair, when you put things in perspective, the fairness of critical sections looks far less problematic. And the performance is just not in the same league.
Keep in mind that with higher locking performance, the duration of locks will be reduced, which in turns means that contention is less likely occur. And low locking performance means that contention can become a problem in situations where it just shouldn’t have been a problem.
As an illustration, in the previous test case:
- Critical Sections: system CPU usage is around 70%, so almost 3 threads are busy at any time, only one is waiting on the lock.
- TMonitor: system CPU usage is 25%, meaning that everything is serialized, and only one thread is active at any time.
So even if the code was very short and very contention-prone, and some of the CPU usage could be spinlocks rather than actual work, there is obviously some potential for parallelization.
The test gave similar results for XE and XE2, both in 32bit & 64bit mode. XE3 results are similar but slightly slower (TMonitor being 14.5 times slower instead of 13 times like with XE/XE2). If any one wants to test against XE4 or on other CPU, feel free to download the source code [4] and send me the binary, so I can test on the same machine for comparison purposes.
Conclusion
The greater fairness of TMonitor comes not only at an execution overhead cost, but also at a greater contention cost. It’s likely the fairness was actually a side-effect of the greater contention (and effective serialization).
If you’re serious about multi-threading performance, TMonitor should not appear in your code. At all.
edit: While nothing is confirmed for XE4 yet, things actually got a little worse with XE3, and XE4 is identical to XE3 (cf. comments, thanks Chris).