I have (or actually working on) a program (some pairs trading strategy) which does the following:
After these 9 steps, start over again, retrieve another training window and perform the analysis...
My approach would be - and please correct if you see something better:
1. Extract as many functions as possible from the program
2. Loop step 1-9 through multiple training and trading windows
and my resulting question (inspired by many threads here in the forum and i.e. How to make your python code run faster
numba.jit()
" all functions?float64
? what disadvantage could happen? (at the moment they are "standard" numeric)Please apologize the many - rather conceptual - questions, but I think, if I could understand all above "pain" points, it would really boost my "logical" understanding and it also would be very beneficial for new python joiners.
Non-trivial questions can yield but simplified answers:
[man*decades]
...Simply put, do not expect to read a few examples and become expert in this.
No. 1: Poor algorithm will never improve just by some (semi-)automated transformation. Smart re-factoring may jump +100% in performance inside the native plain python code ( example below ), yet a well crafted code, matching the close-to-silicon properties of the code-execution device will show such efforts in other light, as the code-improvement of the said ~ +100% performance boom results having almost the same performance once transformed into a jit
-compiled unit. This means that pre-mature optimisations may turn out useless on going into carefully engineered high-performance code. At least you have been warned.
python
is a great tool, I love it for almost-infinitely precise maths. Yet, carving out both ultimate precision and ultimate performance seems closer to the Heisenberg's principle, than Computer Scientists and the more the fans are keen to admit. Just 've spend long time in doing both to be able to put it compact into a few words paragraph.
numba.jit()
" all functions?numba
is a great tool for stable code-base, so let's start with it:
Low hanging fruit of automated-transformations are easily cherry-picked with numba.jit()
tools. Benchmarking will help you shave-off almost all overheads your code need not carry with.
If depending on code-elements, that the still evolving numba.jit()
code-transformers cannot transcode, your are done. Having worked with numba
since it's very initial releases, { list | dict | class | ... }
are killers for any further dreams of having a code (auto-)converted a bit nearer to the silicon. Also all the there referred functions have to be able to get numba.jit()
, so almost forget about having some high-level import
-ed code-bases easily translated using numba
, if their original code was not systematically designed with numba
in mind.
float64
?float32
has, besides a halved static sizing of the memory-footprint in [SPACE]
-domain, a few principal disadvantages.
Some modules ( typically those, that inherit from FORTRAN numerical solvers and similar heritage ) auto-convert any externally passed data into their local float64
replicas ( so both [SPACE]
and [TIME]
penalty grows, outside of your domain of control ).
Better expect an increased code-execution penalty in [TIME]
-domain, as non-aligned cell-boundaries are expensive ( this goes deep into assembly-level of the code and the CPU-instruction sets and cache-hierarchies to master all details on this level ).
May see almost ~ 3x slower execution on float32
in benchmarks below.
Auto-vectorising transformers are nothing less than a Nobel-prize target.
Some tweaking can be done by smart and handy designers. Do not expect any low-hanging fruit in this domain for any more complex operations, than a trivial broadcasting or some easy to design numpy striding trics.
Professional code-tranformation packages are expensive ( someone has to pay the expertise collected over many [man*years] ) and typically may adjust their ROI only if deployed on scale.
You are happy not to have to design your code to run in a true-[PARALLEL]
, but a "just"-[CONCURRENT]
fashion. If someone says parallel, check if the system indeed requires to meet all conditions for a true-[PARALLEL]
process-scheduling, in majority of cases a "just"-[CONCURRENT]
scheduling is just what the speaker asked for ( details go beyond the scope of this post ). Python GIL-stepped locking prevents any but sub-process-based split of workflow, but at a cost, so harvest your independence of processing, as this will reward your intentions without paying any penalty of additional overhead add-on costs, if going against rules of the overhead-strict Amdahl's Law.
The actual cost/effect ratio will have to be validated on respective version of python, numba, CPU / cache architecture, so benchmarking is the only way to confirm any improvement ( at costs ).
The following example shows the saving for a trivial Exponential Moving Average function, implemented in several more or less smart manner.
def plain_EMA_fromPrice( N_period, aPriceVECTOR ):
...
@numba.jit( "float32[:]( int32, float32[:] )" )
def numba_EMA_fromPrice_float32( N_period, aPriceVECTOR ):
...
@numba.jit( "float32[:]( int32, float32[:] )", nopython = True, nogil = True )
def numba_EMA_fromPrice_float32_nopython( N_period, aPriceVECTOR ):
...
@numba.jit( "float64[:]( int64, float64[:] )" )
def numba_EMA_fromPrice_float64( N_period, aPriceVECTOR ):
...
@numba.jit( "float64[:]( int64, float64[:] )", nopython = True, nogil = True )
def numba_EMA_fromPrice_float64_nopython( N_period, aPriceVECTOR ):
...
def plain_EMA_fromPrice2( N_period, aPriceVECTOR ):
...
@numba.jit( "float32[:]( int32, float32[:] )" )
def numba_EMA_fromPrice2_float32( N_period, aPriceVECTOR ):
...
@numba.jit( "float32[:]( int32, float32[:] )", nopython = True, nogil = True )
def numba_EMA_fromPrice2_float32_nopython( N_period, aPriceVECTOR ):
...
@numba.jit( "float64[:]( int64, float64[:] )" )
def numba_EMA_fromPrice2_float64( N_period, aPriceVECTOR ):
...
@numba.jit( "float64[:]( int64, float64[:] )", nopython = True, nogil = True )
def numba_EMA_fromPrice2_float64_nopython( N_period, aPriceVECTOR ):
...
Improving performance
from 710 [us] -> 160 [us]
, by code-re-factoring and memory-alignment, next
downto -> 12 ~ 17 [us]
by numba.jit()
:
>>> aPV_32 = np.arange( 100, dtype = np.float32 )
>>> aPV_64 = np.arange( 100, dtype = np.float32 )
>>> aClk.start();_ = plain_EMA_fromPrice( 18, aPV_32 );aClk.stop()
715L
723L
712L
722L
975L
>>> aClk.start();_ = plain_EMA_fromPrice( 18, aPV_64 );aClk.stop()
220L
219L
216L
193L
212L
217L
218L
215L
217L
217L
>>> aClk.start();_ = numba_EMA_fromPrice_float32( 18, aPV_32 );aClk.stop()
199L
15L
16L
16L
17L
13L
16L
12L
>>> aClk.start();_ = numba_EMA_fromPrice_float64( 18, aPV_64 );aClk.stop()
170L
16L
16L
16L
18L
14L
16L
14L
17L
>>> aClk.start();_ = numba_EMA_fromPrice_float64_nopython( 18, aPV_64 );aClk.stop()
16L
17L
17L
16L
12L
16L
14L
16L
15L
>>> aClk.start();_ = plain_EMA_fromPrice2( 18, aPV_32 );aClk.stop()
648L
654L
662L
648L
647L
>>> aClk.start();_ = plain_EMA_fromPrice2( 18, aPV_64 );aClk.stop()
165L
166L
162L
162L
162L
163L
162L
162L
>>> aClk.start();_ = numba_EMA_fromPrice2_float32( 18, aPV_32 );aClk.stop()
43L
45L
43L
41L
41L
42L
>>> aClk.start();_ = numba_EMA_fromPrice2_float64( 18, aPV_64 );aClk.stop()
17L
16L
15L
17L
17L
17L
12L
>>> aClk.start();_ = numba_EMA_fromPrice2_float64_nopython( 18, aPV_64 );aClk.stop()
16L
15L
15L
14L
17L
15L