…or why you can’t hide under the complexity carpet 😉
As uncovered in previous episodes, one of the keys behind TMonitor performance issues is that it allocates a dynamic block of memory for its locking purposes, and when those blocks end up allocated on the same CPU cache line, the two TMonitor on the same cache line will end up fighting for the cache line, resulting in a drastic drop of performance and thread contention. The technical term for that behavior is false sharing.
A quick fix that can come to mind would be to force the allocation of TMonitor’s blocks early on, so that the blocks don’t end up contiguous, and hope and pray that in more complex situations, this will happen automagically.
Alas, that’s a fragile solution, for instance if you take the code in the link mentioned above, you’ll find it doesn’t work all that well:
- run the same untouched test on different CPUs with larger cache lines or different cache associativity, and the contention can be back
- instantiate a different class than TInterfaceList, or subclass it and add a few fields to it, and the contention is back
Why is that?
First, different CPU have different cache lines and associativity, so if you have cache-line size dependent code, you need to ask Windows about it. See for instance “How do I determine the processor’s cache line size?“.
Second, you don’t have control on how contiguous dynamic memory will be. FastMM f.i. is a bucket-based allocator, blocks that fall in the same bucket size will be allocated in sequence, in the previous code, with the empty TInterfaceList, you’ll have (optimistically*) allocated something like:
- TInterfaceList instance 1
- TMonitor 1 dynamic data
- TInterfaceList instance 2
- TMonitor 2 dynamic data
Which makes both monitor’s dynamic data non-contiguous, and if that’s enough to have both TMonitor’s data end up on different cache lines, the test will fly. But if you don’t have some other dynamic data that is of the appropriate size? the TMonitor’s data will still be contiguous…
*: in practice, even if the same buckets are involved, there is no guarantee the memory order will be the above, as FastMM recycles buckets, so the exact order can depend on the order in which previously allocated buckets of the same size were freed.
Note that if in your application’s code, you don’t have any other dynamic data that happens to fall in just the same bucket size as TMonitor’s data, all your TMonitor are likely to be contiguous (and even more so if you tend to allocate stuff first, and then run it, without manually pre-allocating the TMonitors).
In the above code, raw TInterfaceList instances are 24 bytes in size, and happen to fall in the same bucket as TMonitor’s 28 bytes data (the 32 bytes bucket).
With a linear garbage-collected allocator, similar contiguousness issues can appear after a garbage collection’s compaction, even if linear allocation was used initially and separated the blocks.
An interesting weakness can also be exposed: a TMonitor’s data (inherently shared) can end up sitting in the middle of thread-specific dynamic data, resulting in another form of false sharing. In that case, TMonitor will not fight with another TMonitor for the cache line, but with your own code and dynamic data.
Why is TRTLCriticalSection not as vulnerable?
After all TRTLCriticalSection is only 24 bytes in size, and thus, smaller than a cache line?
Well it benefits from being a record, and thus usually not dynamically allocated on its own, but as part of a larger structure/object, which reduces the risks of it being on its own cache line (though if you’re not careful, you can easily end up with false sharing with the other owner object’s fields f.i.).
Note that TCriticalSection dynamically allocates the space for a TRTLCriticalSection, and thus can partially exhibit the false sharing issues that can plague TMonitor’s dynamic data.
The only way to be safe from false sharing, is to allocate large enough blocks, so that you guarantee they use a distinct cache line. In TMonitor’s case, the fix would be to allocate a larger block, rather than a small 28 bytes block as is currently the case.
Ideally, TCriticalSection instances should also be made larger, so their only drawback compared to TRTLCriticalSection would be the (rather negligible) virtual call overhead.
Multi-threading is hard, when you spot a simple problem in a simple test, don’t try to hide it under the complexity carpet, fix it while it’s still simple 😉