Kudos to the Firefox 4 TraceMonkey team!

By Eric Grange / March 24, 2011

I’ve been quite impressed with the JavaScript floating point performance in FireFox 4, which puts the Delphi compiler to shame. See for yourself this fractal rendering demo:

Mandelbrot Set in HTML 5 Canvas

I’ve made a version of the same code in Delphi XE (source + pre-compiled executable, 331 kB ZIP), and on my machine here, for the 480×480 resolution, where FireFox 4 gets the default view rendered in 124 ms, where the “regular” Delphi version, which is limited to the old FPU, takes about 200 ms…

It takes manually SSE-enhanced Delphi code to get back on top with a 87 ms render time. It’s quick non-optimized scalar SSE code sure, and could likely be improved, but the point remains that without asm, Delphi XE’s native compiler trails TraceMonkey in the floating point department…

So Embarcadero, how is that Delphi 64 version coming? is it properly SSE-enabled?

26 thoughts on “Kudos to the Firefox 4 TraceMonkey team!”

Linas says:

March 24, 2011 at 13:56

On my work PC Firefox is faster even than your Delphi SSE enhanced version (120ms vs 133ms).
A.Bouchez says:

March 24, 2011 at 14:08

It should be, because in x64 there is no x87 op codes, AFAIR.
Nice post!
Nikola J says:

March 24, 2011 at 14:33

Can you please enable it for testing with IE9 (remove or redefine IE comment tag)?
Eric says:

March 24, 2011 at 14:59

@A.Bouchez
x87 is still supported, see there, but it seems that Extended is no longer supported in D64, which would hint at SSE all the way.

@Nikola J
The javascript sample is not mine.

@Linas
Interesting, what CPU do you have? What are your figures for Delphi FPU?
Mike says:

March 24, 2011 at 15:04

Change from double to single and change the references to fbitmap.width to an object member. My guess is that the javascript compiler does exactly that… but I have to admit very impressive though!

Also your sse code could be a bit improve 😉 you only use the single sse opcode variants – I’m sure one could refine that a bit 😉
Linas says:

March 24, 2011 at 15:19

@Eric
Delphi FPU ~165ms. CPU is quite old – Intel Core 2 Duo E4700 @2.60GHz.
Eric says:

March 24, 2011 at 15:41

@Mike
No, the JavaScript version is double-precision, that’s confirmed by the zooming and quality it achieves. The Bitmap.Width is nowhere near the critical path, and it’s implemented rather trivially to begin with (it’s not a call to the WinAPI).

And yes, the SSE can surely be improved, but that was only to underline the complexity you have to go for in Delphi to beat what TraceMonkey achieves directly on a dynamic language, while Delphi is held back by a prehistoric floating-point codegen.
Mike says:

March 24, 2011 at 16:08

@Eric:

Actually I’m really impressed by the JS performance!!! (and also that they realy use double prec. data values)

In my case even the sse implementation is much slower than the JS engine
(a core2 duo 2.33GHz).

But it helped in my case that I at least removed the calls to fbitmat.width (which is a virtual one…) and declaring the drawpixel method inline…

And you are definitely right the FPU codegen IS PREHISTORIC and needs a to be revamped!

Also I guess the delphi compiler guys should also sneak into the js compiler internals to see how they achieve that 😉
A.Bouchez says:

March 24, 2011 at 18:19

@Eric Yes the extented type is official deprecated and won’t be available in Delphi 64 bit compiler. See what David I said.
dwrbudr says:

March 24, 2011 at 21:56

On my old AMD Athlon Thoroughbred 1700+ (no SSE) Delphi beats Firefox
Delphi: 665ms
Firefox 4: 2523ms
.. so do not make general assumptions yet.
Ajasja Ljubeti? says:

March 25, 2011 at 12:11

Does anyone have time to check how Free pascal (FPC) compares?
If I understand correctly FPC can generate SSE instructions using “fpc -O3 -CfSSE2”

Best regards,
Ajasja
Eric says:

March 25, 2011 at 14:50

@Ajasja Ljubeti?
I just tested it via Lazarus (FPC 2.5.1) auto-convert tool, after excluding DrawPixel (as LCL’s TBitmap doesn’t provide a ScanLines access), the performance is similar to TraceMonkey on my machine, so that’s better than Delphi indeed.

But to keep things in perspective, a slightly more refined SSE version (still just scalar, not SIMD) than the one in the zip I posted now runs twice faster than TraceMonkey on my machine, so there is still compiler optimization potential.
Ajasja Ljubeti? says:

March 26, 2011 at 10:36

Hmm, then I seriously have to consider FPC as an alternative (I’m writing
some number crunching stuff in Delphi).
Thank you!
Victor says:

March 26, 2011 at 16:26

You didn’t even mention the other big drawback of the Delphi version over the Javascript/HTML 5 code: you had to use a ‘special trick’ in DrawPixel, using cached scanlines, to make the pixel drawing fast enough. Many Delphi developers will be unaware of the performance benefits over using Canvas.Pixels.

When tinkering with the code of your Delphi program, I noticed a strange phenomenon: the standard non-assembly code ComputeMandelDelphi routine took almost twice as long when pressing the reset button, compared to the first run.
The effect was gone when I changed the floats from Double to Single.
Anyone guess what causes this?

Turns out the stack pointer was aligned differently between the initial run and the reset run; first on an 8 byte multiple, but next on a 4 byte multiple. This causes a bad performance hit when reading and writing doubles. A could ‘fix’ this by inserting a routine the checks the current stack pointer and moves it 4 bytes if needed (google for ‘StackAlloc’).

To sum up: a Delphi developer currently needs a lot of know-how to workaround compiler and VCL inefficiencies, while these issues are solved automatically for web developers by the latest browsers. I hope the new Delphi compiler will eliminate this (although Canvas.Pixels is another matter).
Jon Lennart Aasenden says:

March 29, 2011 at 00:36

@Eric
To use Delphi like properties of the lazarus graphics system, you need to add the unit LCLIntf.
Dorin Duminica says:

March 29, 2011 at 01:38

Results on my i7:
– Chrome: 42ms
– Firefox: 49ms
– Opera: 98 ms — my favorite
– Internet Exploiter 9: 439ms — this is insane

Get ready for mobile and web apps. that’s the future.
Dorin Duminica says:

March 29, 2011 at 01:47

Sorry for double posting, but I thought I’ll let you know my results with the pre-compiled executable in the post:
– normal: 140.6ms
– SSE enhanced: 79.4ms
Eric says:

March 29, 2011 at 08:59

@Jon Lennart Aasenden
LCLIntf is already in the uses, it doesn’t seem to contain any reference to ‘TBitmap’ anyway.

@Dorin Duminica
Your i7 results are a bit suprising, I’ve got an i5, and if my “normal” and “SSE enhanced” are similar to yours, the browsers scores (apart from IE9) just don’t match. Chrome f.i. is much slower than FireFox 4 on my i5.
Jon Lennart Aasenden says:

March 29, 2011 at 15:30

@Eric
It’s been a while since i played around with graphics in lazarus. They have 2 types of graphics objects. The first one attempts to be platform independent (which is very slow). So TBitmap and TCanvas under Lazarus wont be a fair comparison (besides, Firefox will use DIBS exclusively). I seem to remember that TLazIntfImage is the class you want.
Eric says:

March 29, 2011 at 16:30

@Jon Lennart Aasenden
TLazIntfImage seems to wrap a raw image of some kind, however I couldn’t see an obvious way to specify a pixel depth or to pass it to the LCL image component in an efficient fashion, I guess there are too many layers of abstraction, and I have too little time to unravel them 🙂
Dorin Duminica says:

March 29, 2011 at 18:05

@Eric
Maybe the fact that I was running a VM in the same time might have influenced the results, but I don’t think it matters too much, after all the idea is that javascript has become a very fast scripting engine.
BTW. I was running the “normal size” and just tested on my i3 laptop in Opera, the result is 130ms…
Jon Lennart Aasenden says:

March 30, 2011 at 16:18

Having thought about this for a couple of days (it was shocking at first), the comparison might not be entirely fair.

First of all, pixel plotting using TCanvas/TBitmap is expensive. But compared to all the maths involved here, it’s hardly TCanvas that’s the bottleneck. But you should be able to shave off quite a few milliseconds by going for Graphics32 or any pure DIB class.

The other, more important factor is no doubt threading. I have no idea how Firefox handles scripts, but i seem to remember that javascript is decoded into an intermediate format (bytecodes) and executed in it’s own thread. Delphi on the other hand is a message based system which tries to use as little CPU as possible. We might get more realistic results if we put the code in it’s own thread with priority set to normal?

The thread should have exclusive access to it’s graphics dib, and the mainform should poll this out in the onTerminate event and display it.

I realize ofcourse that this will never make up for SSE and all the other lowlevel features, but at least it might give us a programatic environment that is more fair for comparison?
Eric says:

March 30, 2011 at 17:15

@Jon Lennart Aasenden

But you should be able to shave off quite a few milliseconds by going for Graphics32 or any pure DIB class.

If you look at the Delphi code, you’ll see it’s already bypassing TCanvas, and performs direct access, and actually if you comment out the pixel writing, timings don’t change.

The approach in the code I posted is faster than a DIB class, it’s direct memory access with pre-cached scanline pointers, no clipping, no checks, and as Victor remarked, it probably already unfair for JavaScript, as no doubt most Delphi users would use less efficient pixel access…

We might get more realistic results if we put the code in it’s own thread with priority set to normal?

There are no messages involved during the computation, and the message queue isn’t executing. Note that FireFox has to deal with a message queue, and the JavaScript version actually makes use of events too.
Jon Lennart Aasenden says:

April 1, 2011 at 23:36

@Eric
Just a thought 🙂
Jon Lennart Aasenden says:

April 1, 2011 at 23:39

@Eric
And I did download the code btw. What i looked at was the fact that you perform an assignment of the picture to a TImage (or something to that effect, i have a fever at the moment so i might be mistaken). This should not be a part of the computation since it’s a very slow process in Delphi. But you are right, on overall Delphi is slower here, no doubts about it.
Eric says:

April 2, 2011 at 00:42

This should not be a part of the computation since it’s a very slow process in Delphi.

It isn’t part of the timing (neither in Delphi nor in JavaScript), but yes, I also don’t think including it in the timings (on both sides) would favor Delphi.

Comments are closed.

DelphiTools

DWS, Profiler and other Pascal tools

Kudos to the Firefox 4 TraceMonkey team!

26 thoughts on “Kudos to the Firefox 4 TraceMonkey team!”

Related posts

26 thoughts on “Kudos to the Firefox 4 TraceMonkey team!”