In this new episode of the TMonitor saga is a comparison of its locking features vs TRTLCriticalSection, aka the OS-supported critical section mechanism. Is it fair? is it fast?
Fairness means that if you have several threads attempting to lock the same resources, they’ll all get a chance and none of them gets the lion’s share. In other words, that there are no race conditions.
Speed is obvious, it’s about the overhead the locking mechanism incurs.
Edit 2013-08-24: looks like the issue discussed here should be resolved in XE5, cf Monitoring the Monitor by Allen Bauer.
Test case
The test case presented here aims to not just stress the locking mechanism, but also measure its fairness.
A simple way to do that is to start multiple threads, and then count over a given time slice, how many time each thread could acquire the lock. Not that in this particular test cases, the issue uncovered in previous episodes don’t apply, since there is a single lock.
For that we can use the following structure
type TData = array [0..SAMPLES-1, 0..NBTHREADS-1] of Integer; var Data : TData; Counter : Integer;
And each of the thread will then have the following workload:
procedure TTestThread.Execute; begin while not Terminated do begin LockData; try if Counter>=NB then Terminate else begin Inc(Counter); Inc(Data[Counter shr SAMPLES_BITS, N]); end; finally UnLockData; end; end; end;
So each threads increments a counter, then increments the number of times it could acquire the counter over a given counter range of 2^SAMPLE_BITS samples.
The resulting array can then easily be visualized.
Outcome
Running the test case with four threads on a (true) quad-core CPU, here is the raw resulting array:
Critical sections on top, TMonitor on bottom
What immediately jumps out is that TMonitor is fair, very fair, while critical sections don’t look quite so fair.
However, the timings do tell a different story… TMonitor version executes in 1.36 sec, while critical sections version executed in just 0.1 sec, so 13 times faster… When you scale the graphic in proportion to the time taken, the outcome becomes:
So while critical sections still aren’t as perfectly fair, when you put things in perspective, the fairness of critical sections looks far less problematic. And the performance is just not in the same league.
Keep in mind that with higher locking performance, the duration of locks will be reduced, which in turns means that contention is less likely occur. And low locking performance means that contention can become a problem in situations where it just shouldn’t have been a problem.
As an illustration, in the previous test case:
- Critical Sections: system CPU usage is around 70%, so almost 3 threads are busy at any time, only one is waiting on the lock.
- TMonitor: system CPU usage is 25%, meaning that everything is serialized, and only one thread is active at any time.
So even if the code was very short and very contention-prone, and some of the CPU usage could be spinlocks rather than actual work, there is obviously some potential for parallelization.
The test gave similar results for XE and XE2, both in 32bit & 64bit mode. XE3 results are similar but slightly slower (TMonitor being 14.5 times slower instead of 13 times like with XE/XE2). If any one wants to test against XE4 or on other CPU, feel free to download the source code and send me the binary, so I can test on the same machine for comparison purposes.
Conclusion
The greater fairness of TMonitor comes not only at an execution overhead cost, but also at a greater contention cost. It’s likely the fairness was actually a side-effect of the greater contention (and effective serialization).
If you’re serious about multi-threading performance, TMonitor should not appear in your code. At all.
edit: While nothing is confirmed for XE4 yet, things actually got a little worse with XE3, and XE4 is identical to XE3 (cf. comments, thanks Chris).
Very interesting! Did you ever look into the differences between TCriticalSection and TRTLCriticalSection (if any)? I actually use TFixedCriticalSection (TCriticalSection with the extra 96 bytes).
The results for XE4 are identical to XE3.
thank you for this! //O_O\\
\\ | //
@Mike Versteeg TCriticalSection is a thin wrapper around TRTLCriticalSection, once you avoid the cache collision issue with the extra bytes, the performance should be the same (and only marginally slower if you turn off inlining).
Mind you, the strange thing was that the first time I ran the code (compiled in XE4) I got 0.086 – 4.230 but each subsequent test gets me the results around 0.079 – 10.342. I cant for the life of me account for the times in that first run. @Chris
@Eric
Thanks Eric.
How about my simple locks?
procedure Lock(var Lk: integer);
{$IFDEF CPUX64}
asm
.NOFRAME
mov rdx, rcx
mov ecx, 1
@Loop:
mov eax, 0
lock cmpxchg dword ptr [rdx], ecx
jnz @Loop
end;
{$ELSE CPUX64}
asm
mov edx, eax
mov ecx, 1
@Loop:
mov eax, 0
lock cmpxchg dword ptr [edx], ecx
jnz @Loop
end;
{$ENDIF CPUX64}
procedure UnLock(var Lk: integer);
asm
{$IFDEF CPUX64}
.NOFRAME
lock dec dword ptr[rcx];
{$ELSE CPUX64}
lock dec dword ptr[eax];
{$ENDIF CPUX64}
end;
Imho it’s not surprising at all, that TMonitor is slower than a critical section. Just look at the mass of code that comes with TMonitor – which is due to the fact, that it has more features obviously.
The actual ratio is a bit surprising though …
Given the issues on simple locking, I have my doubts these more advanced features are actually usable or doing what they advertize.
@Eric
These advanced features are certainly usable, the question more what price are you willing to pay? I think that simplicity is “more”, thus would recommend using “simple” things like critical sections where possible …
@Olaf Monien Are they really usable? At least up to XE there are unfixed issues (f.i. http://stackoverflow.com/questions/4856306/tthreadedqueue-not-capable-of-multiple-consumers), and given the other problems, it doesn’t look likely TMonitor ever got stress-tested (unlike OmniThreadLibray f.i.).
@Vitali
It will be efficient, for sure, but it will not release the thread until it accesses the locked semaphore.
I suspect a Windows Critical Section will be more versatile, since in most case it will run as fast as your own code (in most case, Critical Section API will stay in user land and will use only a few asm opcodes), but it will work as expected to release the thread and not burn CPU just for waiting on contention.