Mandelbrot Set in HTML 5 Canvas
I’ve made a version of the same code in Delphi XE (source + pre-compiled executable, 331 kB ZIP), and on my machine here, for the 480×480 resolution, where FireFox 4 gets the default view rendered in 124 ms, where the “regular” Delphi version, which is limited to the old FPU, takes about 200 ms…
It takes manually SSE-enhanced Delphi code to get back on top with a 87 ms render time. It’s quick non-optimized scalar SSE code sure, and could likely be improved, but the point remains that without asm, Delphi XE’s native compiler trails TraceMonkey in the floating point department…
So Embarcadero, how is that Delphi 64 version coming? is it properly SSE-enabled?
26 thoughts on “Kudos to the Firefox 4 TraceMonkey team!”
On my work PC Firefox is faster even than your Delphi SSE enhanced version (120ms vs 133ms).
It should be, because in x64 there is no x87 op codes, AFAIR.
Can you please enable it for testing with IE9 (remove or redefine IE comment tag)?
x87 is still supported, see there, but it seems that Extended is no longer supported in D64, which would hint at SSE all the way.
Interesting, what CPU do you have? What are your figures for Delphi FPU?
Also your sse code could be a bit improve 😉 you only use the single sse opcode variants – I’m sure one could refine that a bit 😉
Delphi FPU ~165ms. CPU is quite old – Intel Core 2 Duo E4700 @2.60GHz.
And yes, the SSE can surely be improved, but that was only to underline the complexity you have to go for in Delphi to beat what TraceMonkey achieves directly on a dynamic language, while Delphi is held back by a prehistoric floating-point codegen.
Actually I’m really impressed by the JS performance!!! (and also that they realy use double prec. data values)
In my case even the sse implementation is much slower than the JS engine
(a core2 duo 2.33GHz).
But it helped in my case that I at least removed the calls to fbitmat.width (which is a virtual one…) and declaring the drawpixel method inline…
And you are definitely right the FPU codegen IS PREHISTORIC and needs a to be revamped!
Also I guess the delphi compiler guys should also sneak into the js compiler internals to see how they achieve that 😉
@Eric Yes the extented type is official deprecated and won’t be available in Delphi 64 bit compiler. See what David I said.
On my old AMD Athlon Thoroughbred 1700+ (no SSE) Delphi beats Firefox
Firefox 4: 2523ms
.. so do not make general assumptions yet.
Does anyone have time to check how Free pascal (FPC) compares?
If I understand correctly FPC can generate SSE instructions using “fpc -O3 -CfSSE2”
I just tested it via Lazarus (FPC 2.5.1) auto-convert tool, after excluding DrawPixel (as LCL’s TBitmap doesn’t provide a ScanLines access), the performance is similar to TraceMonkey on my machine, so that’s better than Delphi indeed.
But to keep things in perspective, a slightly more refined SSE version (still just scalar, not SIMD) than the one in the zip I posted now runs twice faster than TraceMonkey on my machine, so there is still compiler optimization potential.
Hmm, then I seriously have to consider FPC as an alternative (I’m writing
some number crunching stuff in Delphi).
When tinkering with the code of your Delphi program, I noticed a strange phenomenon: the standard non-assembly code ComputeMandelDelphi routine took almost twice as long when pressing the reset button, compared to the first run.
The effect was gone when I changed the floats from Double to Single.
Anyone guess what causes this?
Turns out the stack pointer was aligned differently between the initial run and the reset run; first on an 8 byte multiple, but next on a 4 byte multiple. This causes a bad performance hit when reading and writing doubles. A could ‘fix’ this by inserting a routine the checks the current stack pointer and moves it 4 bytes if needed (google for ‘StackAlloc’).
To sum up: a Delphi developer currently needs a lot of know-how to workaround compiler and VCL inefficiencies, while these issues are solved automatically for web developers by the latest browsers. I hope the new Delphi compiler will eliminate this (although Canvas.Pixels is another matter).
To use Delphi like properties of the lazarus graphics system, you need to add the unit LCLIntf.
Results on my i7:
– Chrome: 42ms
– Firefox: 49ms
– Opera: 98 ms — my favorite
– Internet Exploiter 9: 439ms — this is insane
Get ready for mobile and web apps. that’s the future.
Sorry for double posting, but I thought I’ll let you know my results with the pre-compiled executable in the post:
– normal: 140.6ms
– SSE enhanced: 79.4ms
@Jon Lennart Aasenden
LCLIntf is already in the uses, it doesn’t seem to contain any reference to ‘TBitmap’ anyway.
Your i7 results are a bit suprising, I’ve got an i5, and if my “normal” and “SSE enhanced” are similar to yours, the browsers scores (apart from IE9) just don’t match. Chrome f.i. is much slower than FireFox 4 on my i5.
It’s been a while since i played around with graphics in lazarus. They have 2 types of graphics objects. The first one attempts to be platform independent (which is very slow). So TBitmap and TCanvas under Lazarus wont be a fair comparison (besides, Firefox will use DIBS exclusively). I seem to remember that TLazIntfImage is the class you want.
@Jon Lennart Aasenden
TLazIntfImage seems to wrap a raw image of some kind, however I couldn’t see an obvious way to specify a pixel depth or to pass it to the LCL image component in an efficient fashion, I guess there are too many layers of abstraction, and I have too little time to unravel them 🙂
BTW. I was running the “normal size” and just tested on my i3 laptop in Opera, the result is 130ms…
Having thought about this for a couple of days (it was shocking at first), the comparison might not be entirely fair.
First of all, pixel plotting using TCanvas/TBitmap is expensive. But compared to all the maths involved here, it’s hardly TCanvas that’s the bottleneck. But you should be able to shave off quite a few milliseconds by going for Graphics32 or any pure DIB class.
The thread should have exclusive access to it’s graphics dib, and the mainform should poll this out in the onTerminate event and display it.
I realize ofcourse that this will never make up for SSE and all the other lowlevel features, but at least it might give us a programatic environment that is more fair for comparison?
@Jon Lennart Aasenden
If you look at the Delphi code, you’ll see it’s already bypassing TCanvas, and performs direct access, and actually if you comment out the pixel writing, timings don’t change.
Just a thought 🙂
And I did download the code btw. What i looked at was the fact that you perform an assignment of the picture to a TImage (or something to that effect, i have a fever at the moment so i might be mistaken). This should not be a part of the computation since it’s a very slow process in Delphi. But you are right, on overall Delphi is slower here, no doubts about it.
Comments are closed.