I currently have a giant time-series array with times-series data of multiple securities and economic statistics.
I've already written a function to classify the data, using sci-kit learn, but the function only uses non-lagged time-series data.
Is there a way, in Python, using sci-kit, to automatically lag all of these time-series to find what time-series (if any) tend to lag other data?
I'm working on creating a model using historic data to predict future performance.
TLDR
but a few QF gems put in between the lines for those who care
"There is no dinner for free" so we will have to pay a cost for having the wished result, but you know it is pretty worth doing it, so, get advanced creativity, numpy
utilities knowledge and scikit-learn
tools ready and turn the volume button of imagination on max.
Next, do not expect the process to deliver results in just few seconds.
With an experience of AI/ML-hyperparameters' search job-schedules on example DataSETs
spanning just about X.shape ~ ( 400, 200000 )
, the best-generalising ML-engine hyperparametrisation crossvalidation takes regularly several days on a distributed multi-processing cluster.
As a bonus for directing further Quant research efforts:
a sample from a similar feature-engineering research, withLDF()/GDF()
indcators about varying predictive power of respective features elaborated into anextendedDataSET
:
as written below, one may realise that
just the top 1. feature is responsible for 43% per se
and the next 27. features account for +17%
and the "rest" 360+ features added the remaining 40% into decisions as importances report
(
individual feature and pre-lagging detail$ are not publi$hed here for obviou$ rea$on$
and are free to be discussed separately
)
|>>> aFeatureImportancesMAP_v4( loc_PREDICTOR, X_v412 )
ID.|LDF( fI ) | GDF|HUMAN_READABLE_FEATURE_NAME[COL] min() | MAX() | var()
___|__________|____|___________________________[___]___________|__________|____________
| | | [ ] | |
0. 0.4360231 | 43%| __________xxxxxxxxxxxxxCE [216] min: ... | MAX: ... | var(): ...
1. 0.0464851 | 48%| __________xxxxxxxxxxxxx_0 [215] min: ... | MAX: ... | var(): ...
2. 0.0104704 | 49%| __________xxxxxxxxxxxxx_1 [251] min: ... | MAX: ... | var(): ...
3. 0.0061596 | 49%| __________xxxxxxxxxxxxx_3 [206] min: ... | MAX: ... | var(): ...
4. 0.0055069 | 50%| __________xxxxxxxxxxxxx_2 [203] min: ... | MAX: ... | var(): ...
5. 0.0053235 | 50%| __________xxxxxxxxxxxxx_3 [212] min: ... | MAX: ... | var(): ...
6. 0.0050404 | 51%| ________f_xxxxxxxxxxxxx_7 [261] min: ... | MAX: ... | var(): ...
7. 0.0049998 | 52%| ________f_xxxxxxxxxxxxx_7 [253] min: ... | MAX: ... | var(): ...
8. 0.0048721 | 52%| __________xxxxxxxxxxxxx_4 [113] min: ... | MAX: ... | var(): ...
9. 0.0047981 | 52%| __________xxxxxxxxxxxxx_4 [141] min: ... | MAX: ... | var(): ...
10. 0.0043784 | 53%| __________xxxxxxxxxxxxx_3 [142] min: ... | MAX: ... | var(): ...
11. 0.0043257 | 53%| __________xxxxxxxxxxxxx_4 [129] min: ... | MAX: ... | var(): ...
12. 0.0042124 | 54%| __________xxxxxxxxxxxxx_1 [144] min: ... | MAX: ... | var(): ...
13. 0.0041864 | 54%| ________f_xxxxxxxxxxxxx_8 [260] min: ... | MAX: ... | var(): ...
14. 0.0039645 | 55%| __________xxxxxxxxxxxxx_1 [140] min: ... | MAX: ... | var(): ...
15. 0.0037486 | 55%| ________f_xxxxxxxxxxxxx_8 [252] min: ... | MAX: ... | var(): ...
16. 0.0036820 | 55%| ________f_xxxxxxxxxxxxx_8 [268] min: ... | MAX: ... | var(): ...
17. 0.0036384 | 56%| __________xxxxxxxxxxxxx_1 [108] min: ... | MAX: ... | var(): ...
18. 0.0036112 | 56%| __________xxxxxxxxxxxxx_2 [207] min: ... | MAX: ... | var(): ...
19. 0.0035978 | 56%| __________xxxxxxxxxxxxx_1 [132] min: ... | MAX: ... | var(): ...
20. 0.0035812 | 57%| __________xxxxxxxxxxxxx_4 [248] min: ... | MAX: ... | var(): ...
21. 0.0035558 | 57%| __________xxxxxxxxxxxxx_3 [130] min: ... | MAX: ... | var(): ...
22. 0.0035105 | 57%| _______f_Kxxxxxxxxxxxxx_1 [283] min: ... | MAX: ... | var(): ...
23. 0.0034851 | 58%| __________xxxxxxxxxxxxx_4 [161] min: ... | MAX: ... | var(): ...
24. 0.0034352 | 58%| __________xxxxxxxxxxxxx_2 [250] min: ... | MAX: ... | var(): ...
25. 0.0034146 | 59%| __________xxxxxxxxxxxxx_2 [199] min: ... | MAX: ... | var(): ...
26. 0.0033744 | 59%| __________xxxxxxxxxxxxx_1 [ 86] min: ... | MAX: ... | var(): ...
27. 0.0033624 | 59%| __________xxxxxxxxxxxxx_3 [202] min: ... | MAX: ... | var(): ...
28. 0.0032876 | 60%| __________xxxxxxxxxxxxx_4 [169] min: ... | MAX: ... | var(): ...
...
62. 0.0027483 | 70%| __________xxxxxxxxxxxxx_8 [117] min: ... | MAX: ... | var(): ...
63. 0.0027368 | 70%| __________xxxxxxxxxxxxx_2 [ 85] min: ... | MAX: ... | var(): ...
64. 0.0027221 | 70%| __________xxxxxxxxxxxxx_1 [211] min: ... | MAX: ... | var(): ...
...
104. 0.0019674 | 80%| ________f_xxxxxxxxxxxxx_3 [273] min: ... | MAX: ... | var(): ...
105. 0.0019597 | 80%| __________xxxxxxxxxxxxx_6 [ 99] min: ... | MAX: ... | var(): ...
106. 0.0019199 | 80%| __________xxxxxxxxxxxxx_1 [104] min: ... | MAX: ... | var(): ...
...
169. 0.0012095 | 90%| __________xxxxxxxxxxxxx_4 [181] min: ... | MAX: ... | var(): ...
170. 0.0012017 | 90%| __________xxxxxxxxxxxxx_3 [ 9] min: ... | MAX: ... | var(): ...
171. 0.0011984 | 90%| __________xxxxxxxxxxxxx_4 [185] min: ... | MAX: ... | var(): ...
172. 0.0011926 | 90%| __________xxxxxxxxxxxxx_1 [ 19] min: ... | MAX: ... | var(): ...
...
272. 0.0005956 | 99%| __________xxxxxxxxxxxxx_2 [ 33] min: ... | MAX: ... | var(): ...
273. 0.0005844 | 99%| __________xxxxxxxxxxxxx_2 [127] min: ... | MAX: ... | var(): ...
274. 0.0005802 | 99%| __________xxxxxxxxxxxxx_3 [ 54] min: ... | MAX: ... | var(): ...
275. 0.0005663 | 99%| __________xxxxxxxxxxxxx_3 [ 32] min: ... | MAX: ... | var(): ...
276. 0.0005534 | 99%| __________xxxxxxxxxxxxx_1 [ 83] min: ... | MAX: ... | var(): ...
...
391. 0.0004347 |100%| __________xxxxxxxxxxxxx_2 [ 82] min: ... | MAX: ... | var(): ...
So rather plan & reserve a bit more vCPU-cores capacities,
than to expect it to run such search on a laptop just during a forthcoming lunchtime...
the intended auto-find
service, due to many reasons, is not a part of scikit-learn
, however, the goal is achievable.
We will use the following adaptation steps, that will allow us to make it work:
we'll rely on scikit-learn
abilities to search for a best tandem of a [learner
+ hyperparameters
] for a well-defined AI/ML problem
we'll rely on numpy
powers for obvious reasons to support scikit-learn
phase
we'll rely on rather a proper scikit-learn
AI/ML-engine processing & proces-controls ( pipeline
, GridSearchCV
et al ), which are by far better optimised on low-level performance for such massive-scale-attacks, than to try to depend on an "externally" ordinated for
-looping ( which loses all the valuable cache/data-locality advances ) and is known to be of a remarkable performance disadvantage.
we'll substitute the wished autodiscovery by a fast, one-step, a-priori DataSET
adaptation
we'll let scikit-learn
decide ( quantitatively indicate ) which pre-lagged features, artificially synthesised into compound DataSET
elaborated in [4] have finally the best predictive powers
DataSET
adaptation with smart numpy
aids:Your DataSET
consists of an unspecified count of TimeSeries-data. For each such, you assume some pre-lagging may have better predictive powers, that you would like to find ( quantitatively support the selection of such, for the final ML-predictor ).
So let's first construct for each TimeSerie DataSET[i,:]
in the source-part of the DataSET
an extended part of the DataSET
, which contains the respectively pre-lagged versions of this TimeSerie:
>>> def generate_TimeSERIE_preLAGs( aTimeSERIE, pre_lag_window_depth ):
... #
... # COURTESY & THANKS TO:
... # Nicolas P. Rougier, INRIA
... # Author: Joe Kington / Erik Rigtorp
... #
... shape = ( aTimeSERIE.size - pre_lag_window_depth + 1,
... pre_lag_window_depth
... )
... strides = ( aTimeSERIE.itemsize,
... aTimeSERIE.itemsize
... )
... return np.lib.stride_tricks.as_strided( aTimeSERIE,
... shape,
... strides = strides
... )
...
>>> xVECTOR = np.arange( 10 )
>>>
>>> pre_laggz_on_xVECTOR = generate_TimeSERIE_preLAGs( xVECTOR, 4 )
>>>
>>> pre_laggz_on_xVECTOR
array([[0, 1, 2, 3],
[1, 2, 3, 4],
[2, 3, 4, 5],
[3, 4, 5, 6],
[4, 5, 6, 7],
[5, 6, 7, 8],
[6, 7, 8, 9]])
>>>
With such extended ( wider, and you know that a lot ) but static extendedDataSET
, containing now both the original TimeSERIE
vector and all the wished-to-test pre-lagged versions thereof, your ML-search starts.
Phase [1.A]
Initially use scikit-learn
tools for the best feature-selection supporting your hypothesis
+
Phase [1.B]
next start hyperparameter optimisation for the best cross-validation results supporting a maximum generalisation ability of the learner.
The phase [1.B]
, naturally ought to run on a sub-set of the extendedDataSET
( as was intentionally extended for the sake of scikit-learn
evaluation in feature-selection-phase [1.A]
).
For the sake of your further interests in TimeSeries analyses & quantitative modelling, you might be interested in the best answer on this >>>
Correlation does not imply Causation, so even more care is needed in making decisions to be executed ( paper can always handle much more, than the markets :o) ).