Search code examples
pythonperformancequantitative-financeconceptualparallelism-amdahl

Non-trivial recoding: How to speed up my program? Cython, numba, multiprocessing and numpy?


I have (or actually working on) a program (some pairs trading strategy) which does the following:

  1. Retrieve a subset of a larger data (financial data: datetime index and stock prices for ~100 stocks) set sitting in a postgres database.
  2. Clean the data (drop stocks with >30% NaN) and calculate returns and indexing (relative to the first observation of each stock)
  3. Find all combinations of the stock pairs, calculate the correlation (actually some measure similar to that, but too important here)
  4. Rank the pairs highest correlation to lowest or only pick pairs with correlation > defined threshold, i.e. 0.9
  5. Check each of these pairs for cointegration, both ways! and rank them according to their test-value
  6. Pick the top n, i.e. 10 pairs, to be traded and calculate some signal based on moving averages and std
  7. Retrieve an "out-of-sample" window and trade the stocks
  8. Record everyday's return (i.e. during 5 days) in a logbook
  9. Calculate some statistics

After these 9 steps, start over again, retrieve another training window and perform the analysis...

My approach would be - and please correct if you see something better:
1. Extract as many functions as possible from the program
2. Loop step 1-9 through multiple training and trading windows

and my resulting question (inspired by many threads here in the forum and i.e. How to make your python code run faster

  • How can I identify which part of my code can be run in parallel?
  • Somehow, it does seem trivial to me at all: What technique to apply to "re-write" the code, such that it can use multi-processing?
  • Also not always obvious: Rewriting loops as functions, any specific angle to always look at?
  • Does it make sense to "numba.jit()" all functions?
  • Should I change all formats of my data to float64? what disadvantage could happen? (at the moment they are "standard" numeric)
  • Is there any checklist from which I can see when a loop can be vectorized?

Please apologize the many - rather conceptual - questions, but I think, if I could understand all above "pain" points, it would really boost my "logical" understanding and it also would be very beneficial for new python joiners.


Solution

  • Non-trivial questions can yield but simplified answers:

    Performance improvements needs a lot attention, ~[man*decades] ...

    Simply put, do not expect to read a few examples and become expert in this.

    No. 1: Poor algorithm will never improve just by some (semi-)automated transformation. Smart re-factoring may jump +100% in performance inside the native plain python code ( example below ), yet a well crafted code, matching the close-to-silicon properties of the code-execution device will show such efforts in other light, as the code-improvement of the said ~ +100% performance boom results having almost the same performance once transformed into a jit-compiled unit. This means that pre-mature optimisations may turn out useless on going into carefully engineered high-performance code. At least you have been warned.

    python is a great tool, I love it for almost-infinitely precise maths. Yet, carving out both ultimate precision and ultimate performance seems closer to the Heisenberg's principle, than Computer Scientists and the more the fans are keen to admit. Just 've spend long time in doing both to be able to put it compact into a few words paragraph.


    Q4: Does it make sense to "numba.jit()" all functions?

    numba is a great tool for stable code-base, so let's start with it:

    Low hanging fruit of automated-transformations are easily cherry-picked with numba.jit() tools. Benchmarking will help you shave-off almost all overheads your code need not carry with.

    If depending on code-elements, that the still evolving numba.jit() code-transformers cannot transcode, your are done. Having worked with numba since it's very initial releases, { list | dict | class | ... } are killers for any further dreams of having a code (auto-)converted a bit nearer to the silicon. Also all the there referred functions have to be able to get numba.jit(), so almost forget about having some high-level import-ed code-bases easily translated using numba, if their original code was not systematically designed with numba in mind.


    Q5: Should I change all formats of my data to float64?
    what disadvantage could happen? ( at the moment they are "standard" numeric )

    float32 has, besides a halved static sizing of the memory-footprint in [SPACE]-domain, a few principal disadvantages.

    Some modules ( typically those, that inherit from FORTRAN numerical solvers and similar heritage ) auto-convert any externally passed data into their local float64 replicas ( so both [SPACE] and [TIME] penalty grows, outside of your domain of control ).

    Better expect an increased code-execution penalty in [TIME]-domain, as non-aligned cell-boundaries are expensive ( this goes deep into assembly-level of the code and the CPU-instruction sets and cache-hierarchies to master all details on this level ).

    May see almost ~ 3x slower execution on float32 in benchmarks below.


    Q6: Is there any checklist from which I can see when a loop can be vectorized?

    Auto-vectorising transformers are nothing less than a Nobel-prize target.

    Some tweaking can be done by smart and handy designers. Do not expect any low-hanging fruit in this domain for any more complex operations, than a trivial broadcasting or some easy to design numpy striding trics.

    Professional code-tranformation packages are expensive ( someone has to pay the expertise collected over many [man*years] ) and typically may adjust their ROI only if deployed on scale.


    Q1: How can I identify which part of my code can be run in parallel?

    You are happy not to have to design your code to run in a true-[PARALLEL], but a "just"-[CONCURRENT] fashion. If someone says parallel, check if the system indeed requires to meet all conditions for a true-[PARALLEL] process-scheduling, in majority of cases a "just"-[CONCURRENT] scheduling is just what the speaker asked for ( details go beyond the scope of this post ). Python GIL-stepped locking prevents any but sub-process-based split of workflow, but at a cost, so harvest your independence of processing, as this will reward your intentions without paying any penalty of additional overhead add-on costs, if going against rules of the overhead-strict Amdahl's Law.


    Benchmark, benchmark, benchmark:

    The actual cost/effect ratio will have to be validated on respective version of python, numba, CPU / cache architecture, so benchmarking is the only way to confirm any improvement ( at costs ).

    The following example shows the saving for a trivial Exponential Moving Average function, implemented in several more or less smart manner.

    def plain_EMA_fromPrice(                   N_period, aPriceVECTOR ):
        ...
    @numba.jit(                                                         "float32[:]( int32, float32[:] )" )
    def numba_EMA_fromPrice_float32(           N_period, aPriceVECTOR ):
        ...
    @numba.jit(                                                         "float32[:]( int32, float32[:] )", nopython = True, nogil = True )
    def numba_EMA_fromPrice_float32_nopython(  N_period, aPriceVECTOR ):
        ...
    @numba.jit(                                                         "float64[:]( int64, float64[:] )" )
    def numba_EMA_fromPrice_float64(           N_period, aPriceVECTOR ):
        ...
    @numba.jit(                                                         "float64[:]( int64, float64[:] )", nopython = True, nogil = True )
    def numba_EMA_fromPrice_float64_nopython(  N_period, aPriceVECTOR ):
        ...    
    
    
    def plain_EMA_fromPrice2(                  N_period, aPriceVECTOR ):
        ...
    @numba.jit(                                                         "float32[:]( int32, float32[:] )" )
    def numba_EMA_fromPrice2_float32(          N_period, aPriceVECTOR ):
        ...
    @numba.jit(                                                         "float32[:]( int32, float32[:] )", nopython = True, nogil = True )
    def numba_EMA_fromPrice2_float32_nopython( N_period, aPriceVECTOR ):
        ...
    @numba.jit(                                                         "float64[:]( int64, float64[:] )" )
    def numba_EMA_fromPrice2_float64(          N_period, aPriceVECTOR ):
        ...
    @numba.jit(                                                         "float64[:]( int64, float64[:] )", nopython = True, nogil = True )
    def numba_EMA_fromPrice2_float64_nopython( N_period, aPriceVECTOR ):
        ...
    

    Improving performance
    from 710 [us] -> 160 [us], by code-re-factoring and memory-alignment, next
    downto -> 12 ~ 17 [us] by numba.jit():

    >>> aPV_32 = np.arange( 100, dtype = np.float32 )
    >>> aPV_64 = np.arange( 100, dtype = np.float32 )
    
    >>> aClk.start();_ = plain_EMA_fromPrice(                  18, aPV_32 );aClk.stop()
    715L
    723L
    712L
    722L
    975L
    
    >>> aClk.start();_ = plain_EMA_fromPrice(                  18, aPV_64 );aClk.stop()
    220L
    219L
    216L
    193L
    212L
    217L
    218L
    215L
    217L
    217L
    
    >>> aClk.start();_ = numba_EMA_fromPrice_float32(          18, aPV_32 );aClk.stop()
    199L
    15L
    16L
    16L
    17L
    13L
    16L
    12L
    
    >>> aClk.start();_ = numba_EMA_fromPrice_float64(          18, aPV_64 );aClk.stop()
    170L
    16L
    16L
    16L
    18L
    14L
    16L
    14L
    17L
    
    >>> aClk.start();_ = numba_EMA_fromPrice_float64_nopython( 18, aPV_64 );aClk.stop()
    16L
    17L
    17L
    16L
    12L
    16L
    14L
    16L
    15L
    
    >>> aClk.start();_ = plain_EMA_fromPrice2(                 18, aPV_32 );aClk.stop()
    648L
    654L
    662L
    648L
    647L
    
    >>> aClk.start();_ = plain_EMA_fromPrice2(                 18, aPV_64 );aClk.stop()
    165L
    166L
    162L
    162L
    162L
    163L
    162L
    162L
    
    >>> aClk.start();_ = numba_EMA_fromPrice2_float32(         18, aPV_32 );aClk.stop()
    43L
    45L
    43L
    41L
    41L
    42L
    
    >>> aClk.start();_ = numba_EMA_fromPrice2_float64(         18, aPV_64 );aClk.stop()
    17L
    16L
    15L
    17L
    17L
    17L
    12L
    
    >>> aClk.start();_ = numba_EMA_fromPrice2_float64_nopython( 18, aPV_64 );aClk.stop()
    16L
    15L
    15L
    14L
    17L
    15L