Last episode in the TMonitor saga. In the previous episode, Chris Rolliston posted a more complete test case, for which he got surprising results (including that a Critical Section approach wouldn’t scale with the thread count). Starting from his code I initially also got similar surprising results.
edit: apparently the “crash” part of the TMonitor issues have been acknowledged by the powers that be, and a hotfix could be on the way, though it points back to QC 78415, an issue reported in 2009, ouch. Guess those 4 bytes per instance haven’t seen much use…
Revised Test with Stable Results
I simplified his code (see below), by dropping the usage of several RTL classes and features, and went for a straightforward implementation, in the process, the oddities went away as far as Critical Section is concerned, and partially so as far as TMonitor goes…
The results can be summarized by this chart:
This was measured on a quad-core, as you can see the Critical Section version stays flat until the number of threads gets greater than the core count, at which point, there is a small ramping arising from the workload taking its toll. TMonitor is a different story, if the revised test doesn’t exhibit the poor scaling I was finding in my previous test, there is still a ramping, as well as a wild jump once there are more threads than cores.
Which RTL class or what exactly was the source of the behavior in Chris’s original code, I don’t know. One possible cause pointed by Krystian in a former comment could be that instances can end up in the same cache line, though that doesn’t explain everything, it could be a start is major factor.
Note that TMonitor allocates its own small block for its locking purposes, distinct from the object instance, and AFAICT there are no provisions in case those blocks end up in the same cache line, though I’m not convinced yet that’s the issue we’re seeing here, this could be a source of contention.
edit: Krystian posted some sample code with cache-line collision avoidance, with it TMonitor becomes much more linear, though half as fast as a CS, and there are still occasional spurious slowdowns showing up in the timings.
Test Code
Here is the test code used for the above, if you test on your machine, make sure you have selected the high performance profile in Windows Power options, and that you don’t have any implicit affinity settings kicking in on the executable.
You can call the above code from a form where you’ll have dropped a TMemo to use as log, as I’m assuming you don’t want to slum it in a command line executable 😉
const
   cCountdownFrom = $FFFFFF; //increase if necessary...
   cMaxThreads = 10;
type
   TTestThread = class(TThread);
   TTestThreadClass = class of TTestThread;
   TCriticalSectionThread = class(TTestThread)
      FCriticalSection: TRTLCriticalSection;
      procedure Execute; override;
   end;
   TMonitorThread = class(TTestThread)
      procedure Execute; override;
   end;
procedure RunTest(log : TStrings; const testName : String; threadCount : Integer;
                  threadClass : TTestThreadClass);
var
   i : Integer;
   threads : array of TThread;
   tstop, tstart, freq : Int64;
begin
   SetLength(threads, threadCount);
   for i:=0 to threadCount-1 do
      threads[i]:=threadClass.Create(True);
   QueryPerformanceCounter(tstart);
   for i:=0 to threadCount-1 do
      threads[i].Start;
   for i:=0 to threadCount-1 do
      threads[i].WaitFor;
   QueryPerformanceCounter(tstop);
   QueryPerformanceFrequency(freq);
   log.Add(Format('%s: %d thread(s) took %.1f ms',
                  [testName, threadCount, (tstop-tstart)*1000/freq]));
   for i:=0 to threadCount-1 do
      threads[i].Free;
end;
procedure TCriticalSectionThread.Execute;
var
   counter : Integer;
begin
   InitializeCriticalSection(FCriticalSection);
   counter:=cCountdownFrom;
   repeat
      EnterCriticalSection(FCriticalSection);
      try
         Dec(counter);
      finally
         LeaveCriticalSection(FCriticalSection);
      end;
   until counter<=0;
   DeleteCriticalSection(FCriticalSection);
end;
procedure TMonitorThread.Execute;
var
   counter : Integer;
begin
   counter:=cCountdownFrom;
   repeat
      System.TMonitor.Enter(Self);
      try
         Dec(counter);
      finally
         System.TMonitor.Exit(Self);
      end;
   until counter<=0;
end;
procedure RevisedChrisTest(log : TStrings);
var
   i, j : Integer;
begin
   for i:=1 to 3 do begin
      log.Add('*** ROUND '+IntToStr(i)+' ***');
      for j:=1 to cMaxThreads do begin
         RunTest(log, 'TCriticalSection', j, TCriticalSectionThread);
         RunTest(log, 'TMonitor', j, TMonitorThread);
      end;
   end;
end;
			



Nice work. Can you please indicate on your chart what you’re plotting? I guessed that it’s milliseconds vs. threads but as now the chart is not really intuitive to grasp for all readers.
@gabr
My GoogleCharts-fu seems to be lacking, I can’t quite find a way to put a title on the axis without messing with the scale, or using up axis space…
The tests use one (internal) synchronization object per thread, so they will never compete for the same critical section or monitor lock, is this intentionally?
@Michael
Yes, the point was to check that no artificial thread contention is introduced by the lock system. The artificial contention arising from critical sections is minimal (probably just the atomic bus locks), while with TMonitor, something else is happening…
@Eric
“… could be that instances can end up in the same cache line, though that doesn’t explain everything, it could be a start.”
Here are modified test and results:
http://code.google.com/p/kblib/downloads/detail?name=TMonitorTests2.zip
As you can see, when that ‘hidden TMonitor record’ is allocated in ‘different processor cache line boundary’ (I’m allocating more objects, and entering to TMonitor to create that record), then TMonitor performs much better.
But still it’s slower than CS, probably because TMonitor do much more calls (like CheckMonitorSupport, GetMonitor, GetFieldAddress, accessing InstanceSize, etc). It looks like TMonitor is 2x slower. When CS is not ‘locked’ it works, almost like single InterlockerCompareExchange (not other CALL inst used).
Also your FCriticalSection is in thread instance, which on my D2009 is 92 bytes, so it doesn’t collide here with ProcessorCacheLineSize.
As you can see at my “Tests on i5-2500K.txt”, when there is 4 threads (=cores on test machine), then TMonitor works almost linear as CS (that i5 processor have a TurboMode, so when more cores then freq is smaller then when works only at 1 core).
If you modify example to make FCriticalSection as a pointer and allocate it in Execute, then you will see *same* slowdown as in TMonitor, like:
TCriticalSectionThread = class(TTestThread)
FCriticalSection: —>PRTLCriticalSection;
…
New(FCriticalSection);
InitializeCriticalSection(FCriticalSection^);
counter:=cCountdownFrom;
repeat
EnterCriticalSection(FCriticalSection^);
try
Dec(counter);
finally
LeaveCriticalSection(FCriticalSection^);
end;
until counter<=0;
DeleteCriticalSection(FCriticalSection^);
Dispose(FCriticalSection);
Here is my perfect resonable explanation 😉
PS. That bug in TMonitor.Wait is other thing…
Also about that allocation in ProcessorCacheLineSize boundary, it’s only ‘problem’, because FastMM optimizes allocations, and often puts same size allocations one by one by 8 bytes alignment (or 16 if Align16Bytes defined), if earlier there was no allocation of this block size, that can be reused, but it’s internals of memory mamanger 😉
If you could force FastMM to allocate memory in 64bytes alignment, then this problem would be not exists (which mostly come in clean test cases), but there would be a problem with higher memory usage.
(My previous comment is awaiting moderations – which contain explanation of this whole issue – so above comment is out of context, for now 😉 )
A chart too! Show off! 😉
Part of the reason I used the RTL classes was because I had in the back of my mind doing a direct performance comparison with C#, a comparison that I indeed ended up doing (perhaps the people who work on adding C#/.NET copycat features might do the same occasionally…?). Also, I prefer using console apps for this sort of thing, and TThread.WaitFor assumes a VCL GUI context… (I could have used the WaitForXXX API though of course.)
What about increasing the work that is done inside the locked blocks?
I did some for i := 0 to $FF do Dec(Counter); instead of the single Dec statement and the results are that the TMonitor version is just a little bit slower than the critical section one. When I profile your sample with sampling profiler it always turns out that InterlockedCompareExchange (>25%) and TMonitor.Exit (~15%) seem to slow everything down. Or did I miss some point again?
@Krystian
Thanks Krystian! So the TMonitor small allocation block is indeed problematic, I guess given that it’s allocated upon first use, the problem is bound to be recurrent, as one will typically first allocate the objects, then lock them at a later time, so the small allocations have a high probability of being together…
The issue isn’t just limited to FastMM, most allocators tend to allocate memory linearly at some point.
@Stefan Glienke
Well, usually you want the code inside a critical section or lock to be as short and minimal as possible, as it’s supposed to operate on a shared resource or data (otherwise, you don’t need a CS in the first place!). If you do some long jobs inside a CS, you’ll just serialize your executions on that shared data, and lose the benefits of multi-threading.
@Chris
Well I have only very rarely had issues with the WinAPI or VCL getting in the way for such tests, so I don’t bother with a console app.
Also, production code rarely ever ends up in a console app here, so if there are issues related to the VCL and a GUI, I want to see them.
@Krystian
With your code avoiding cache line collisions, TMonitor is more linear, but there are still occasional spurious results with TMonitor, that do not exist with the CS, so there must still be another something fishy in the TMonitor code…
@Eric
I understand. My point was just to put some more work into the thread to make it more “realistic” (hope you know what I mean) so its job is not only to lock/unlock all the time (referring to A.Bouchez comment on Chris blog). But yes, that is of course changing the situation. What about putting the loop outside of the locked block?
I totally agree that there is some problem with the TMonitor implementation in Delphi – before Chris blogged about this yesterday I implemented your yesterdays example in C# and was just wondering how fast it was 😉
@Stefan Glienke
Just to be sure on the C# lock, anyone checked what it is actually compiled to? Just to make sure the C# isn’t smart enough to figure out the lock or try/finally is unnecessary, and thus eliminates it.
@Eric
So this still the last episode on TMonitor but probably not the final one…
Oh.. Wheres Mason with his popcorns???
I voted for http://qc.embarcadero.com/wc/qcmain.aspx?d=78415 – it doesn’t hurt and maybe it helps 🙂
Mmm, popcorn… Where was I? Ah yes: if I take Krystian’s version (or a variant thereof) and replace TRTLCriticalSection with TCriticalSection (i.e., the simple class wrapper), TMonitor now beats it. See ‘update 3’ of my original post for a bit more info…