python machine-learning parallel-processing scikit-learn algorithmic-trading

How to parallelize a backtesting for a trading strategy based on a machine learning model?

For the purpose of backtesting a trading strategy, based on a machine learning model, I wanted to compute the retraining procedure of the model in parallel.

_{Now my question is:}

Is there an opportunity to improve the speed of my algorithm?

It is implemented in python using scikit-learn. The procedure of backtesting is defined as follows:

Train the model on 900 ( the first three years ) data points
Make a prediction for the next day t+1
Retrain the model on the next datapoint t+1
again make a prediction with the model for the day t+2
Retrain the model ....
make a prediction ...

Simply, make a prediction for the next data point and retrain the model on this data point. Then do this till the current day ( last data point ). For some stock predictions, this could be up to 5000 data points, which means by starting this backtesting procedure with a model trained on the first 900 data points, I need to retrain and predict the model 4100 times.

To parallelize the procedure I am using multiprocessing. I have the chance to use a server, that provides 40 CPU kernels. So what I am doing is:

Divide the data points 4100 in 40 parts
Start a process on every kernel which runs the procedure for 1 part
after finishing the procedure, write the result on disk
collect every result and put it together

Python Code:

pool = Pool()
results = []

for i in range(len(parts)):
    try:
        result = pool.apply_async(classify_parts,[parts[i], parts[i+1]])
        results.append(result)
    except IndexError:
        continue

for result in results:
    result.get()

The method classify_parts, starts the procedure for the given range. For instance, if I have 500 data points and will start the whole backtesting by training the model on 100 data points, I have 400 data points ( days ) left for the backtesting, which means:

divide 400 data points in 40 parts [10,20,30,...,380,390,400]
start a process on every kernel:
- classify_parts( 10, 20 ), ... , classify_parts( 390, 400 )
collect the results from disk and put them together

Hopefully I could illustrated my concept in a clear manner.

So my biggest question is, if there is another more efficient way of doing backtesting with a machine learning model, that is retrained on every next data point ( day )? Because with this concept, one backtesting procedure for 5000 data points runs more than 10 minutes.

Maybe incremental / online learning is here the way to go?

Solution

At first, this is not parallelizing the process scheduling.

If in doubts, one may revisit theory details on SERIAL, PARALLEL and "just"-CONCURRENT process scheduling before reading further.

Put in plain English

Things may go into SERIAL mode of operations by-principle ( by intent ) or by some sort of resource-limitation. If one is to type a word "parallel" on a keyboard, the letters p, next a, next r ... etc have to be touched SERIAL-ly, one after another, otherwise the input process fails to provide the guaranteed, correct result - i.e. the word "parallel" on screen ( and let's assume we forget about a chance to use a Shift key for the clarity of this example ).

Anyone may test, what gets produced, if one tries to press all the keys at the same moment, or in pairs, or while a cat is standing at your keyboard ... all additional ideas are welcome for hands-on experimentation.

SERIAL processes cannot go faster

In any other circumstances than by getting faster processing-resources and/or by lowering inter-process transport-delays ( or in case a new, smarter algorithm exists and is known to the kind programmer and yet additionally shows to be feasible to get implemented on resources available ).

What is indeed a PARALLEL process?

On the other hand, some things must be organised so as to happen right in the opposite end of the spectra -- if all things simply have to start + happen + finish at the same time, otherwise the results from the process are considered wrong. What could be brought as an example?

imagine a few visually simply "readable" examples - the show of Red Arrows - a group of 9 jet planes, which displays figures and manoeuvres of high pilotage

or Frecce Tricolori

where all planes move at the same time exactly the given ( same ) way, otherwise the lovely show would fail.
imagine a concert, that has to take place. This is exactly the ultimately PARALLEL process and one would immediately recognise how devastating that would be, if it were not
imagine a robot-control system for several degrees-of-freedom, where all motion has to be carried out and has to be kept under a control of a closed-loop [ actor - detector - actuator ] subsystem in sync, coordinated for all of the several programming-controlled axes. The system must control all the NC-axes together as a PARALLEL process -- otherwise the robot-arm will never be able to follow the black-line - the defined, curved, spatio-temporal trajectory in +4-D space ( best to view in real-time ).

The robot-control example shows, that the farther we move from the aerobatics or the symphony examples, for which we have some direct experience ( or where one's imagination can provide some comprehension ), the PARALLEL process-scheduling may seem harder to sense or to realise, but the symphony seems to be the best one to remind, when questioning if a problem is indeed a truly PARALLEL or not.

Just ask yourself, what would happen, if just half of the ensemble comes in due time to start playing at 19:00 sharp, or what would happen, if some of the violins plays at an accelerated speed... or what would happen, if the players will start playing just one note, in a sequence and let his/her left side-neighbour to play his/her note, waiting until the order of playing gets back to the same player to perform his/her next note in the concert. Correct, that would be pure-SERIAL scheduling. )

So, would 40-CPU-cores allow to somehow accelerate my problem?

The problem still depends on the processing.

If the learning-method ( in an undisclosed MCVE ) has a built in dependence, for example if classify_parts( 20, 30 ) needs to get some input or a piece of information, generated from a "previous" step in classify_parts( 10, 20 ), then the answer is no.

If the learning-method is independent, in the above defined sense, then there is a chance to schedule the process as a "just"-CONCURRENT and benefit from the fact that there are more "free"-lanes on the highway to allow cars run in not just one lane, but allow one to overtake another, slower one, on the way from start to finish. A rugby team can move from city A to city B "faster", if there is 2 or 3 or more 4-seat cars available in the city A stadium, than if there is just 1 and it has to go there and back and forth to move all the players successfully to city B.

Next, even if there are cars enough to move to city B for all players, the duration would be uncomfortably high in case, there is just one lane road from A to B and there is a high traffic in the opposite direction from B to A ( not allowing to overtake ) and a single tractor, also going from A to B, but at a speed of 1 mph.

Warning: ( if in a search for really FAST PROCESSING, the more for the FASTEST POSSIBLE one )

Still, even if there are none of the above sketched "external" obstacles, one has to be very careful, not to lose the benefits available from "bigger" machine + "wider" highways ( more resources available ) for a "just"-CONCURRENT process scheduling.

Yes, the problem here is related with programming tools.

Python can help a lot for fast prototyping, but beware for the Global Interpreter Lock ( GIL ) code-execution implications.

In your case, using a Pool() of 40-threads might look attractive, but will still wait for a GIL-acquisition/release and ALL THE WORK WILL turn to be SERIAL again, just due to omission to notice the known GIL-mechanics on a Pool() of python-threads.

Yes, there are other more promising alternatives, like a ThreadPool() et al, but not only the code-related GIL-stepping is devastating your efforts to accelerate the speed of the process, but also the shared data-structures is a problem to watch and to avoid. Additionally, the Pool() naively-replicated environments may grow beyond one's system RAM capacities and the swapping mechanism will yield such attempt to accelerate the process totally un-usable at all. Share nothing, otherwise your dream to accelerate processing will lose on sharing-mechanics and you are back on the square No.1.

The best way is to design the code as independent processes, that have zero-sharing, non-blocking ( MEM, IO ) operations.