Kudos to the Firefox 4 TraceMonkey team!

I’ve been quite impressed with the JavaScript floating point performance in FireFox 4, which puts the Delphi compiler to shame. See for yourself this fractal rendering demo:

Mandelbrot Set in HTML 5 Canvas

I’ve made a version of the same code in Delphi XE (source + pre-compiled executable, 331 kB ZIP), and on my machine here, for the 480×480 resolution, where FireFox 4 gets the default view rendered in 124 ms, where the “regular” Delphi version, which is limited to the old FPU, takes about 200 ms

It takes manually SSE-enhanced Delphi code to get back on top with a 87 ms render time. It’s quick non-optimized scalar SSE code sure, and could likely be improved, but the point remains that without asm, Delphi XE’s native compiler trails TraceMonkey in the floating point department…

So Embarcadero, how is that Delphi 64 version coming? is it properly SSE-enabled?

26 thoughts on “Kudos to the Firefox 4 TraceMonkey team!

  1. @A.Bouchez
    x87 is still supported, see there, but it seems that Extended is no longer supported in D64, which would hint at SSE all the way.

    @Nikola J
    The javascript sample is not mine.

    @Linas
    Interesting, what CPU do you have? What are your figures for Delphi FPU?

  2. Change from double to single and change the references to fbitmap.width to an object member. My guess is that the javascript compiler does exactly that… but I have to admit very impressive though!

    Also your sse code could be a bit improve 😉 you only use the single sse opcode variants – I’m sure one could refine that a bit 😉

  3. @Mike
    No, the JavaScript version is double-precision, that’s confirmed by the zooming and quality it achieves. The Bitmap.Width is nowhere near the critical path, and it’s implemented rather trivially to begin with (it’s not a call to the WinAPI).

    And yes, the SSE can surely be improved, but that was only to underline the complexity you have to go for in Delphi to beat what TraceMonkey achieves directly on a dynamic language, while Delphi is held back by a prehistoric floating-point codegen.

  4. @Eric:

    Actually I’m really impressed by the JS performance!!! (and also that they realy use double prec. data values)

    In my case even the sse implementation is much slower than the JS engine
    (a core2 duo 2.33GHz).

    But it helped in my case that I at least removed the calls to fbitmat.width (which is a virtual one…) and declaring the drawpixel method inline…

    And you are definitely right the FPU codegen IS PREHISTORIC and needs a to be revamped!

    Also I guess the delphi compiler guys should also sneak into the js compiler internals to see how they achieve that 😉

  5. On my old AMD Athlon Thoroughbred 1700+ (no SSE) Delphi beats Firefox
    Delphi: 665ms
    Firefox 4: 2523ms
    .. so do not make general assumptions yet.

  6. Does anyone have time to check how Free pascal (FPC) compares?
    If I understand correctly FPC can generate SSE instructions using “fpc -O3 -CfSSE2”

    Best regards,
    Ajasja

  7. @Ajasja Ljubeti?
    I just tested it via Lazarus (FPC 2.5.1) auto-convert tool, after excluding DrawPixel (as LCL’s TBitmap doesn’t provide a ScanLines access), the performance is similar to TraceMonkey on my machine, so that’s better than Delphi indeed.

    But to keep things in perspective, a slightly more refined SSE version (still just scalar, not SIMD) than the one in the zip I posted now runs twice faster than TraceMonkey on my machine, so there is still compiler optimization potential.

  8. Hmm, then I seriously have to consider FPC as an alternative (I’m writing
    some number crunching stuff in Delphi).
    Thank you!

  9. You didn’t even mention the other big drawback of the Delphi version over the Javascript/HTML 5 code: you had to use a ‘special trick’ in DrawPixel, using cached scanlines, to make the pixel drawing fast enough. Many Delphi developers will be unaware of the performance benefits over using Canvas.Pixels.

    When tinkering with the code of your Delphi program, I noticed a strange phenomenon: the standard non-assembly code ComputeMandelDelphi routine took almost twice as long when pressing the reset button, compared to the first run.
    The effect was gone when I changed the floats from Double to Single.
    Anyone guess what causes this?

    Turns out the stack pointer was aligned differently between the initial run and the reset run; first on an 8 byte multiple, but next on a 4 byte multiple. This causes a bad performance hit when reading and writing doubles. A could ‘fix’ this by inserting a routine the checks the current stack pointer and moves it 4 bytes if needed (google for ‘StackAlloc’).

    To sum up: a Delphi developer currently needs a lot of know-how to workaround compiler and VCL inefficiencies, while these issues are solved automatically for web developers by the latest browsers. I hope the new Delphi compiler will eliminate this (although Canvas.Pixels is another matter).

  10. Results on my i7:
    – Chrome: 42ms
    – Firefox: 49ms
    – Opera: 98 ms — my favorite
    – Internet Exploiter 9: 439ms — this is insane

    Get ready for mobile and web apps. that’s the future.

  11. Sorry for double posting, but I thought I’ll let you know my results with the pre-compiled executable in the post:
    – normal: 140.6ms
    – SSE enhanced: 79.4ms

  12. @Jon Lennart Aasenden
    LCLIntf is already in the uses, it doesn’t seem to contain any reference to ‘TBitmap’ anyway.

    @Dorin Duminica
    Your i7 results are a bit suprising, I’ve got an i5, and if my “normal” and “SSE enhanced” are similar to yours, the browsers scores (apart from IE9) just don’t match. Chrome f.i. is much slower than FireFox 4 on my i5.

  13. @Eric
    It’s been a while since i played around with graphics in lazarus. They have 2 types of graphics objects. The first one attempts to be platform independent (which is very slow). So TBitmap and TCanvas under Lazarus wont be a fair comparison (besides, Firefox will use DIBS exclusively). I seem to remember that TLazIntfImage is the class you want.

  14. @Jon Lennart Aasenden
    TLazIntfImage seems to wrap a raw image of some kind, however I couldn’t see an obvious way to specify a pixel depth or to pass it to the LCL image component in an efficient fashion, I guess there are too many layers of abstraction, and I have too little time to unravel them 🙂

  15. @Eric
    Maybe the fact that I was running a VM in the same time might have influenced the results, but I don’t think it matters too much, after all the idea is that javascript has become a very fast scripting engine.
    BTW. I was running the “normal size” and just tested on my i3 laptop in Opera, the result is 130ms…

  16. Having thought about this for a couple of days (it was shocking at first), the comparison might not be entirely fair.

    First of all, pixel plotting using TCanvas/TBitmap is expensive. But compared to all the maths involved here, it’s hardly TCanvas that’s the bottleneck. But you should be able to shave off quite a few milliseconds by going for Graphics32 or any pure DIB class.

    The other, more important factor is no doubt threading. I have no idea how Firefox handles scripts, but i seem to remember that javascript is decoded into an intermediate format (bytecodes) and executed in it’s own thread. Delphi on the other hand is a message based system which tries to use as little CPU as possible. We might get more realistic results if we put the code in it’s own thread with priority set to normal?

    The thread should have exclusive access to it’s graphics dib, and the mainform should poll this out in the onTerminate event and display it.

    I realize ofcourse that this will never make up for SSE and all the other lowlevel features, but at least it might give us a programatic environment that is more fair for comparison?

  17. @Jon Lennart Aasenden

    But you should be able to shave off quite a few milliseconds by going for Graphics32 or any pure DIB class.

    If you look at the Delphi code, you’ll see it’s already bypassing TCanvas, and performs direct access, and actually if you comment out the pixel writing, timings don’t change.

    The approach in the code I posted is faster than a DIB class, it’s direct memory access with pre-cached scanline pointers, no clipping, no checks, and as Victor remarked, it probably already unfair for JavaScript, as no doubt most Delphi users would use less efficient pixel access…

    We might get more realistic results if we put the code in it’s own thread with priority set to normal?

    There are no messages involved during the computation, and the message queue isn’t executing. Note that FireFox has to deal with a message queue, and the JavaScript version actually makes use of events too.

  18. @Eric
    And I did download the code btw. What i looked at was the fact that you perform an assignment of the picture to a TImage (or something to that effect, i have a fever at the moment so i might be mistaken). This should not be a part of the computation since it’s a very slow process in Delphi. But you are right, on overall Delphi is slower here, no doubts about it.

  19. This should not be a part of the computation since it’s a very slow process in Delphi.

    It isn’t part of the timing (neither in Delphi nor in JavaScript), but yes, I also don’t think including it in the timings (on both sides) would favor Delphi.

Comments are closed.