Search code examples
performancefortranfortran77julia

Julia (Julia-lang) Performance Compared to Fortran and Python


I adapted a simple program to compute and plot the movement vortices of to Julia to test the language, I also wrote it in Python for no particular reason.

(Disclaimer: 1. Every performance comparison on stackoverflow I read gets slammed for not being comprehensive/correct/well written/relevant etc. etc. - I'm not pretending this is a real comparison, I would just like to know how to make the Julia quicker. 2. I know the python could be optimized, implemented in Cython etc, that's not part of this discussion, it's just here for a reference of equivalent functions in Julia and Python.)

The code and performance results can been seen in a gist.

The performance of Julia is significantly slower than Fortran. The times taken to perform the calculation itself are (50000 time steps):

Fortran: 0.051s
Julia: 2.256s
Python: 30.846s

Julia is much slower (~44 times slow) than Fortran, the gap narrows but is still significant with 10x more time steps( 0.50s vs 15.24s ).

These results are significantly different to those shown on the julia home page. What am I doing wrong? Could I fix the Julia to be significantly quicker?

I've skim read the Julia Performance Tips page and the code behind the comparison on the Julia home page and nothing is standing out to me to fix.

Also interestingly Julia is extremely slow to load PyPlot ( 5secs ish!!) and much slower than Python to read the text file. Could I do anything to improve these things?

Note that the times above don't show loading time for Julia and Python, it's just the raw time taken for the computation AFAIK - see the code. For fortran it's the whole thing. The plotting has been turned off, roughly, in each case to allow speed comparison.

Computer: Intel i7-3770, 16GB ram, SSD HD, OS: Ubuntu 13.10 64bit., Fortran: gfortran, GNU Fortran (Ubuntu/Linaro 4.8.1-10ubuntu9) 4.8.1, Julia: Version 0.3.0-prerelease+396 (2013-12-12 00:18 UTC), Commit c5364db* (0 days old master), x86_64-linux-gnu, Python: 2.7.5+


Update:

Based on ivarne's advice I rewrote the Julia script (updated in gist above): encapsulating grunt work in functions, declaring the type of everything and splitting different elements of matrices into different arrays where applicable. (I included Float64 in quite a few places as I tried Float32 to see if that helped, it didn't most of the time).

The results are as follows:

50,000 time steps:

Fortran: 0.051s (entire programme)
Julia: raw calc.: 0.201s, calc. and return (?): 0.758s, total exec.: 6.947s

500,000 time steps:

Fortran: 0.495s (entire programme)
Julia: raw calc.: 1.547s, calc. and return (?): 2.094s, total exec.: 8.521s

In conclusion:

  • You can speed up Julia quite a bit.

  • You can significantly affect the apparently speed of Julia depending on how you measure it's performance.


Solution

  • I have followed the Julia project for a while now, and I have some comments to the code that might be relevant.

    • It seems like you run a substantial amount of the code in global scope. The global environment is currently very slow in Julia, because the types all variables have to be checked on every iteration. Loops should usually be written in a function.
    • You seem to use array slicing. Currently that makes a copy because Julia does not have fast Array views. You might try to switch them for subarray, but they are currently much slower than they should.

    The loading time of PyPlot (and any other package) is a known issue, and is because parsing and compiling Julia code to machine code, is time consuming. There are ideas about having a cache for this process so that this process becomes instantaneous, but it is not finished yet. The Base library is currently cached in compiled state, so most of the infrastructure is on the master branch now.

    ADDED: I tried to run the test in an isolated function and got these results. See this gist

    Parsing:

    elapsed time: 0.334042578 seconds (11797548 bytes allocated)
    

    And tree consecutive runes of the main test loop.

    elapsed time: 0.62999287 seconds (195210884 bytes allocated)
    elapsed time: 0.39398753 seconds (184735016 bytes allocated)
    elapsed time: 0.392036875 seconds (184735016 bytes allocated)
    

    Notice how the timing improved after the first run, because the compiled code was used again.

    Update 2 With some improved memory handling (ensure reuse of arrays, because assignment does not copy), I got the timing down to 0.2 seconds (on my machine). There is definitely more that could be done to avoid allocating new arrays, but then it starts to be a little tricky.

    This line does not do what you think:

    vx_old = vx
    

    but this do what you want:

    copy!(vx_old, vx)
    

    and devectorize one loop.

    x += 0.5*(vx + vx_old)*delta_t
    y += 0.5*(vy + vy_old)*delta_t
    

    to:

    for i = 1:nvortex
        x[i] += 0.5*(vx[i] + vx_old[i])*delta_t
        y[i] += 0.5*(vy[i] + vy_old[i])*delta_t
    end