André Mussche on Google+ investigated the performance of several Memory Managers for Delphi, in single-threaded & multi-threaded situations, with detailed results and charts on performance and memory usage.
Great work and interesting findings!
- A simple multi-threaded MM speed test (D2010, tested on Windows 7, Intel Quad core).
- More detailed benchmark results of several Delphi memory managers.
His conclusions (which I share)
For single threaded or low memory profile applications, the default Delphi memory manager (FastMM) is the fastest you can get. If you don’t realloc a lot (strings?), TCmalloc [from Google perftools] is fast too.
For multi threaded apps, it’s not easy to decided what to use. ScaleMM2 is the fastest but not stable. TCmalloc is a good one, but uses a lot of memory. MSVCRT [Microsoft allocator in msvcrt.dll] looks scalable in simple multi-threaded tests, but in extended test like FastCodeMMChallenge it is disappointing: slower and uses a lot of memory!
JeMalloc (used by the latest FireFox) is disappointing in multi-threaded areas, but uses the same low memory as FastMM: maybe FF can be made faster by using FastMM? 🙂
Additionally, Hoard was tested, though it performed “off the charts” (in a literal and bad way).
You can check André’s charts for yourself:
All in all, for single-threaded applications, or when you have few threads or limited thread-based memory management, FastMM is still king of the Hill, and not just of the Delphi Hill, both in terms of performance, memory usage and robustness.
Pierre le Riche can be proud of his baby 😉
As for multi-threaded applications, ScaleMM, once stabilized, could well become the next undisputed King of the Hill, and not just of the Delphi Hill again.
I don’t know if Embarcadero are aware of the technical lead this offers to Delphi, this is something worth some marketing buzz and MM authors support surely?
IMHO, bets solution is to wait for FastMM5, where Pierre can implement some tricks to make FastMM more scalable in multithreaded environment. He said, that he can start work for FastMM5 before this year end)
IMHO Embarcadero cares very little about the MM performance, and it has been so for a long time in Delphi. FastCode project was born exactly to make Delphi faster because its RTL was very poor from a performance point of view.
Most Delphi developers don’t care about the MM, they care about what new widgets are added to the palette and don’t write multi-thread applications. See how they included FastMM into Delphi: just a subset and without many ways to control it (as you can do with the full version).
One note: replacement MMs should take into account the need of sharing the MM across packages and DLLs. A MM that require a monolithic executable could be far less useful.
Anyway, what a MM should be is stable. Multithreaded applications with high performance needs are not the ones that can crash at will, usually…
@LDS I agree that speed is more a marketing approach than the main goal for Embarcadero. Writing that XE2 unleash 64 bit power is true from the accessible RAM point of view, but definitively a joke when it deals about RTL performance (even worse in 64 bit than in 32 bit).
There are other ways of making scaling multi-thread applications. Only changing the MM won’t be enough. See http://blog.synopse.info/tag/SynScaleMM
It is also written that TCmalloc [from Google perftools] does not release the used memory (or much less than others) – this may be the main reason why it is fast. 😉 All parameters (including average and long-term memory use) have to be taken in account in such benchmarks. And the only true benchmark is a wall clock on real application.
The Fastcode benchmark has not just synthetic tests, but also portions derived from replays of actual applications.
But what’s kinda sad, is that from a technical point of view, Delphi has the best Memory Manager for single-threaded applications, and what could be the best for all multi-threaded applications, yet that remains confidential.
A memory manager isn’t everything, of course, but it’s a cornerstone.
@LDS
There are many server side frameworks for Delphi like Synapse MORMOT/SqlLite ORM, RemObject´s DataAbstract, RTC´s remoting framework, Delphi´s DataSnap, WebServices and IntraWeb and even databases like NexusDB. All of them uses multi-threading extensively.
@A. Bouchez
I agree that the MM is not a silver bullet if one has written an application that doesn’t scale well over multiple threads. Yet the MM is important, especially if the application is memory intensive, and when some memory allocations are beyond the application control (strings, for example). IMHO Embarcadero should push more the “performance” side, after all if one still uses a native compiler without GC and the like, is because he wants “raw” performance and control. Delivering a so-so solution means weakining the strong advantages of a native compiler in such a area.
@gustavo
No one said Delphi multi-threading doesn’t work, only it could scale better if the MM is designed to take it into account. Other tools used earlier on machine with a lot of CPUs faced those issues earlier as well, most Delphi applications are facing that realm only recently. I don’t see Intraweb running large deployments as Apache, just an example.
When switching from single thread to multi thread applications there could be many subtle and not so subtle bottlenecks to take care of. It could be also difficult to design a one-size-fits-all MM for mt applications, needs could be very different depending on the thread usage.
@A. Bouchez: “Writing that XE2 unleash 64 bit power is true from the accessible RAM point of view, but definitively a joke when it deals about RTL performance (even worse in 64 bit than in 32 bit).”
Care to elaborate? The only concrete discussions of Delphi 64 bit performance I’ve come across were from our esteemed host, which didn’t really support such a judgement, once a compiler setting became known.
I’m still looking for another MM that works with Intraweb. I’ve tried ScaleMM, ScaleMM2, SynScaleMM, BigBrain MM, etc, etc. The fact that none of them can’t handle a simple IW application is very disturbing, and indicates me that they are not quite there yet…
André Mussche told me to try ScaleMM3, and I’ll give it a try soon.
@Alexandre
Are your tests with different MMs expressing some specific issues with Intraweb/FastMM combination or with IW multi-threading in general, or this is only out of curiosity investigation…?
@ZBU
I’m just trying other MM in a IW application that I have to see how it works, check if an IW app can benefit from another MM running in a multicore processor.
The disturbing thing is: If you create the simplest IW app possible (File -> New -> VCL for the Web application) and add another MM to it, it will not run AT ALL with multiple exceptions, access violations, whatever. I guess (only a guess!) that the problem is with the memory managers and not with Intraweb, once it works perfectly when FastMM is the memory manager.
Regards
Where can I get this test codes?
Thanks
@Chris Just take a look at the RTL files. The supplied memory manager is FastMM4 in pure pascal more (slower than x86 asm version), and most-used functions of RTL are also in pure pascal mode, calling InterlockIncrement() external functions and such. At higher level, it is clear that no profiling has been made. Just one test is to call a “for i:= 0 to 100000 do IntStr(i)” in several threads at once, and you’ll find out some awful performance issues (e.g. IntToStr using a temporary AnsiString then conversing it to UnicodeString at each call – occurs since Delphi 2009). I did not see any notable speed improvement in the generated code since Delphi 2007 and the introduction of the “inline” keyword – see http://stackoverflow.com/questions/6372017 – the only exception is the SSE2 math introduced in Delphi XE2 64 bit mode. New features (like generics) tend to duplicate the generated code, and may therefore be slower if used.
@Alexandre Machado Did you set the external MM at first position in the “uses” clause of the .dpr?
@A.Bouchez,
Hi Arnaud.
Yes, I set the alternative MM as the first file in the .DPR file. Also, I’ve tried using FastMM first and then SynScaleMM as the second file, trying to circumvent the Andreas Hausladen’s MidasSpeedFix issue that I’ve told you in the Embarcadero forum, some weeks ago (https://forums.embarcadero.com/message.jspa?messageID=322760#322760)
@Chris
Hi Chris,
The RTL functions for handling strings uses PUREPASCAL when comiling for 64-bit.
This slows down most applications.
I submitted http://qc.embarcadero.com/wc/qcmain.aspx?d=100211
I’m sorry but FastMM in XE2 still has the same huge flaw: its bad use of Sleep() only because its programmer refuses to learn how a critical section works and why it’s a lot better. FastMM is a BIG problem when you need quick response, like when processing short timeslices in an audio mixing thread. NeverSleepOnMMThreadContention makes it less bad, but for real, its author has to learn about critical sections, which are basically doing what he’s attempting to do manually with his Sleep method, BUT with a proper spin count and certainly not using Sleep, which is beyond bad, a lot worse than waiting for an event (WaitForSingleObject). Sleeping for 10ms is really crazy, I don’t think the author realizes what 10ms means for an application. Audio bufferswitches are 2ms to 10ms long these days. FastMM is only fast -statistically-.
FastMM is still pretty cool for non-threaded apps, though.
IME for audio or time-critical processing, the memory manager is somewhat irrelevant as you are (or should) be dealing with statically allocated memory.
@Eric
You can’t always preallocate everything, only what you can predict. Still, memory allocation should only be a problem if you do it too much, it shouldn’t be a problem if you do it once, twice or even a dozen times during a bufferswitch (multithreaded processing too requires costy waits for events, thus kernel switches. Yet it works, audio applications are compatible with multithreaded processing). With FastMM, even once is a problem. Normally, the worst that could happen should be a call to the OS to allocate a new chunk, and even this shouldn’t break audio. And in the best case it should only be a local search for a free memory chunk, which could be lock-free (but ok, that’s very hard to do properly, I wouldn’t expect this in Delphi) or locking using a critical section, which is NOT the evil some like to believe, it’s a spinning at best (a few cycles), WaitForSingleObject (way <0.1ms) at worst.