python scikit-learn time-series quantitative-finance

How to auto-discover a lagging of time-series data in scikit-learn and classify using time-series data

I currently have a giant time-series array with times-series data of multiple securities and economic statistics.

I've already written a function to classify the data, using sci-kit learn, but the function only uses non-lagged time-series data.

Is there a way, in Python, using sci-kit, to automatically lag all of these time-series to find what time-series (if any) tend to lag other data?

I'm working on creating a model using historic data to predict future performance.

Solution

^{TLDR but a few QF gems put in between the lines for those who care}

Prologue:

"There is no dinner for free" so we will have to pay a cost for having the wished result, but you know it is pretty worth doing it, so, get advanced creativity, numpy utilities knowledge and scikit-learn tools ready and turn the volume button of imagination on max.

Next, do not expect the process to deliver results in just few seconds.

With an experience of AI/ML-hyperparameters' search job-schedules on example DataSETs spanning just about X.shape ~ ( 400, 200000 ), the best-generalising ML-engine hyperparametrisation crossvalidation takes regularly several days on a distributed multi-processing cluster.

As a bonus for directing further Quant research efforts:

a sample from a similar feature-engineering research, with LDF()/GDF() indcators about varying predictive power of respective features elaborated into an extendedDataSET:
as written below, one may realise that
just the top 1. feature is responsible for 43% per se
and the next 27. features account for +17%
and the "rest" 360+ features added the remaining 40% into decisions as importances report
_{(
individual feature and pre-lagging detail$ are not publi$hed here for obviou$ rea$on$
and are free to be discussed separately
)}

|>>> aFeatureImportancesMAP_v4( loc_PREDICTOR, X_v412 )

 ID.|LDF( fI ) | GDF|HUMAN_READABLE_FEATURE_NAME[COL] min()     | MAX()    | var()
 ___|__________|____|___________________________[___]___________|__________|____________
    |          |    |                           [   ]           |          |
  0. 0.4360231 | 43%| __________xxxxxxxxxxxxxCE [216] min:  ... | MAX: ... | var():  ...
  1. 0.0464851 | 48%| __________xxxxxxxxxxxxx_0 [215] min:  ... | MAX: ... | var():  ...
  2. 0.0104704 | 49%| __________xxxxxxxxxxxxx_1 [251] min:  ... | MAX: ... | var():  ...
  3. 0.0061596 | 49%| __________xxxxxxxxxxxxx_3 [206] min:  ... | MAX: ... | var():  ...
  4. 0.0055069 | 50%| __________xxxxxxxxxxxxx_2 [203] min:  ... | MAX: ... | var():  ...
  5. 0.0053235 | 50%| __________xxxxxxxxxxxxx_3 [212] min:  ... | MAX: ... | var():  ...
  6. 0.0050404 | 51%| ________f_xxxxxxxxxxxxx_7 [261] min:  ... | MAX: ... | var():  ...
  7. 0.0049998 | 52%| ________f_xxxxxxxxxxxxx_7 [253] min:  ... | MAX: ... | var():  ...
  8. 0.0048721 | 52%| __________xxxxxxxxxxxxx_4 [113] min:  ... | MAX: ... | var():  ...
  9. 0.0047981 | 52%| __________xxxxxxxxxxxxx_4 [141] min:  ... | MAX: ... | var():  ...
 10. 0.0043784 | 53%| __________xxxxxxxxxxxxx_3 [142] min:  ... | MAX: ... | var():  ...
 11. 0.0043257 | 53%| __________xxxxxxxxxxxxx_4 [129] min:  ... | MAX: ... | var():  ...
 12. 0.0042124 | 54%| __________xxxxxxxxxxxxx_1 [144] min:  ... | MAX: ... | var():  ...
 13. 0.0041864 | 54%| ________f_xxxxxxxxxxxxx_8 [260] min:  ... | MAX: ... | var():  ...
 14. 0.0039645 | 55%| __________xxxxxxxxxxxxx_1 [140] min:  ... | MAX: ... | var():  ...
 15. 0.0037486 | 55%| ________f_xxxxxxxxxxxxx_8 [252] min:  ... | MAX: ... | var():  ...
 16. 0.0036820 | 55%| ________f_xxxxxxxxxxxxx_8 [268] min:  ... | MAX: ... | var():  ...
 17. 0.0036384 | 56%| __________xxxxxxxxxxxxx_1 [108] min:  ... | MAX: ... | var():  ...
 18. 0.0036112 | 56%| __________xxxxxxxxxxxxx_2 [207] min:  ... | MAX: ... | var():  ...
 19. 0.0035978 | 56%| __________xxxxxxxxxxxxx_1 [132] min:  ... | MAX: ... | var():  ...
 20. 0.0035812 | 57%| __________xxxxxxxxxxxxx_4 [248] min:  ... | MAX: ... | var():  ...
 21. 0.0035558 | 57%| __________xxxxxxxxxxxxx_3 [130] min:  ... | MAX: ... | var():  ...
 22. 0.0035105 | 57%| _______f_Kxxxxxxxxxxxxx_1 [283] min:  ... | MAX: ... | var():  ...
 23. 0.0034851 | 58%| __________xxxxxxxxxxxxx_4 [161] min:  ... | MAX: ... | var():  ...
 24. 0.0034352 | 58%| __________xxxxxxxxxxxxx_2 [250] min:  ... | MAX: ... | var():  ...
 25. 0.0034146 | 59%| __________xxxxxxxxxxxxx_2 [199] min:  ... | MAX: ... | var():  ...
 26. 0.0033744 | 59%| __________xxxxxxxxxxxxx_1 [ 86] min:  ... | MAX: ... | var():  ...
 27. 0.0033624 | 59%| __________xxxxxxxxxxxxx_3 [202] min:  ... | MAX: ... | var():  ...
 28. 0.0032876 | 60%| __________xxxxxxxxxxxxx_4 [169] min:  ... | MAX: ... | var():  ...
 ...
 62. 0.0027483 | 70%| __________xxxxxxxxxxxxx_8 [117] min:  ... | MAX: ... | var():  ...
 63. 0.0027368 | 70%| __________xxxxxxxxxxxxx_2 [ 85] min:  ... | MAX: ... | var():  ...
 64. 0.0027221 | 70%| __________xxxxxxxxxxxxx_1 [211] min:  ... | MAX: ... | var():  ...
 ...
104. 0.0019674 | 80%| ________f_xxxxxxxxxxxxx_3 [273] min:  ... | MAX: ... | var():  ...
105. 0.0019597 | 80%| __________xxxxxxxxxxxxx_6 [ 99] min:  ... | MAX: ... | var():  ...
106. 0.0019199 | 80%| __________xxxxxxxxxxxxx_1 [104] min:  ... | MAX: ... | var():  ...
 ...
169. 0.0012095 | 90%| __________xxxxxxxxxxxxx_4 [181] min:  ... | MAX: ... | var():  ...
170. 0.0012017 | 90%| __________xxxxxxxxxxxxx_3 [  9] min:  ... | MAX: ... | var():  ...
171. 0.0011984 | 90%| __________xxxxxxxxxxxxx_4 [185] min:  ... | MAX: ... | var():  ...
172. 0.0011926 | 90%| __________xxxxxxxxxxxxx_1 [ 19] min:  ... | MAX: ... | var():  ...
 ...
272. 0.0005956 | 99%| __________xxxxxxxxxxxxx_2 [ 33] min:  ... | MAX: ... | var():  ...
273. 0.0005844 | 99%| __________xxxxxxxxxxxxx_2 [127] min:  ... | MAX: ... | var():  ...
274. 0.0005802 | 99%| __________xxxxxxxxxxxxx_3 [ 54] min:  ... | MAX: ... | var():  ...
275. 0.0005663 | 99%| __________xxxxxxxxxxxxx_3 [ 32] min:  ... | MAX: ... | var():  ...
276. 0.0005534 | 99%| __________xxxxxxxxxxxxx_1 [ 83] min:  ... | MAX: ... | var():  ...
 ...
391. 0.0004347 |100%| __________xxxxxxxxxxxxx_2 [ 82] min:  ... | MAX: ... | var():  ...

So rather plan & reserve a bit more vCPU-cores capacities,
than to expect it to run such search on a laptop just during a forthcoming lunchtime...

Let's have a working plan

the intended auto-find service, due to many reasons, is not a part of scikit-learn, however, the goal is achievable.

We will use the following adaptation steps, that will allow us to make it work:

we'll rely on scikit-learn abilities to search for a best tandem of a [learner + hyperparameters] for a well-defined AI/ML problem
we'll rely on numpy powers for obvious reasons to support scikit-learn phase
we'll rely on rather a proper scikit-learn AI/ML-engine processing & proces-controls ( pipeline, GridSearchCV et al ), which are by far better optimised on low-level performance for such massive-scale-attacks, than to try to depend on an "externally" ordinated for-looping ( which loses all the valuable cache/data-locality advances ) and is known to be of a remarkable performance disadvantage.
we'll substitute the wished autodiscovery by a fast, one-step, a-priori DataSET adaptation
we'll let scikit-learn decide ( quantitatively indicate ) which pre-lagged features, artificially synthesised into compound DataSET elaborated in [4] have finally the best predictive powers

[4] `DataSET` adaptation with smart `numpy` aids:

Your DataSET consists of an unspecified count of TimeSeries-data. For each such, you assume some pre-lagging may have better predictive powers, that you would like to find ( quantitatively support the selection of such, for the final ML-predictor ).

So let's first construct for each TimeSerie DataSET[i,:] in the source-part of the DataSET an extended part of the DataSET, which contains the respectively pre-lagged versions of this TimeSerie:

>>> def generate_TimeSERIE_preLAGs( aTimeSERIE, pre_lag_window_depth ):
...     #
...     # COURTESY & THANKS TO:
...     #                     Nicolas P. Rougier, INRIA
...     #             Author: Joe Kington / Erik Rigtorp
...     #
...     shape   = ( aTimeSERIE.size - pre_lag_window_depth + 1,
...                 pre_lag_window_depth
...                 )
...     strides = ( aTimeSERIE.itemsize,
...                 aTimeSERIE.itemsize
...                 )
...     return np.lib.stride_tricks.as_strided( aTimeSERIE,
...                                             shape,
...                                             strides = strides
...                                             )
...
>>> xVECTOR = np.arange( 10 )
>>>
>>> pre_laggz_on_xVECTOR = generate_TimeSERIE_preLAGs( xVECTOR, 4 )
>>>
>>> pre_laggz_on_xVECTOR
array([[0, 1, 2, 3],
       [1, 2, 3, 4],
       [2, 3, 4, 5],
       [3, 4, 5, 6],
       [4, 5, 6, 7],
       [5, 6, 7, 8],
       [6, 7, 8, 9]])
>>>

With such extended ( wider, and you know that a lot ) but static extendedDataSET, containing now both the original TimeSERIE vector and all the wished-to-test pre-lagged versions thereof, your ML-search starts.

Phase [1.A]
Initially use scikit-learn tools for the best feature-selection supporting your hypothesis
+
Phase [1.B]
next start hyperparameter optimisation for the best cross-validation results supporting a maximum generalisation ability of the learner.

The phase [1.B], naturally ought to run on a sub-set of the extendedDataSET ( as was intentionally extended for the sake of scikit-learn evaluation in feature-selection-phase [1.A] ).

Epilogue:
`Memento mori:` Quants do not like this ... but

For the sake of your further interests in TimeSeries analyses & quantitative modelling, you might be interested in the best answer on this >>>

Correlation does not imply Causation, so even more care is needed in making decisions to be executed ( paper can always handle much more, than the markets :o) ).

How to auto-discover a lagging of time-series data in scikit-learn and classify using time-series data

Prologue:

Let's have a working plan

[4] DataSET adaptation with smart numpy aids:

Epilogue: Memento mori: Quants do not like this ... but

[4] `DataSET` adaptation with smart `numpy` aids:

Epilogue:
`Memento mori:` Quants do not like this ... but