Search code examples
pythonscikit-learntime-seriesquantitative-finance

How to auto-discover a lagging of time-series data in scikit-learn and classify using time-series data


I currently have a giant time-series array with times-series data of multiple securities and economic statistics.

I've already written a function to classify the data, using sci-kit learn, but the function only uses non-lagged time-series data.

Is there a way, in Python, using sci-kit, to automatically lag all of these time-series to find what time-series (if any) tend to lag other data?

I'm working on creating a model using historic data to predict future performance.


Solution

  • TLDR but a few QF gems put in between the lines for those who care

    Prologue:

    "There is no dinner for free" so we will have to pay a cost for having the wished result, but you know it is pretty worth doing it, so, get advanced creativity, numpy utilities knowledge and scikit-learn tools ready and turn the volume button of imagination on max.

    Next, do not expect the process to deliver results in just few seconds.

    With an experience of AI/ML-hyperparameters' search job-schedules on example DataSETs spanning just about X.shape ~ ( 400, 200000 ), the best-generalising ML-engine hyperparametrisation crossvalidation takes regularly several days on a distributed multi-processing cluster.

    As a bonus for directing further Quant research efforts:

    a sample from a similar feature-engineering research, with LDF()/GDF() indcators about varying predictive power of respective features elaborated into an extendedDataSET:
    as written below, one may realise that
    just the top 1. feature is responsible for 43% per se
    and the next 27. features account for +17%
    and the "rest" 360+ features added the remaining 40% into decisions as importances report
    (
    individual feature and pre-lagging detail$ are not publi$hed here for obviou$ rea$on$
    and are free to be discussed separately
    )

    |>>> aFeatureImportancesMAP_v4( loc_PREDICTOR, X_v412 )
    
     ID.|LDF( fI ) | GDF|HUMAN_READABLE_FEATURE_NAME[COL] min()     | MAX()    | var()
     ___|__________|____|___________________________[___]___________|__________|____________
        |          |    |                           [   ]           |          |
      0. 0.4360231 | 43%| __________xxxxxxxxxxxxxCE [216] min:  ... | MAX: ... | var():  ...
      1. 0.0464851 | 48%| __________xxxxxxxxxxxxx_0 [215] min:  ... | MAX: ... | var():  ...
      2. 0.0104704 | 49%| __________xxxxxxxxxxxxx_1 [251] min:  ... | MAX: ... | var():  ...
      3. 0.0061596 | 49%| __________xxxxxxxxxxxxx_3 [206] min:  ... | MAX: ... | var():  ...
      4. 0.0055069 | 50%| __________xxxxxxxxxxxxx_2 [203] min:  ... | MAX: ... | var():  ...
      5. 0.0053235 | 50%| __________xxxxxxxxxxxxx_3 [212] min:  ... | MAX: ... | var():  ...
      6. 0.0050404 | 51%| ________f_xxxxxxxxxxxxx_7 [261] min:  ... | MAX: ... | var():  ...
      7. 0.0049998 | 52%| ________f_xxxxxxxxxxxxx_7 [253] min:  ... | MAX: ... | var():  ...
      8. 0.0048721 | 52%| __________xxxxxxxxxxxxx_4 [113] min:  ... | MAX: ... | var():  ...
      9. 0.0047981 | 52%| __________xxxxxxxxxxxxx_4 [141] min:  ... | MAX: ... | var():  ...
     10. 0.0043784 | 53%| __________xxxxxxxxxxxxx_3 [142] min:  ... | MAX: ... | var():  ...
     11. 0.0043257 | 53%| __________xxxxxxxxxxxxx_4 [129] min:  ... | MAX: ... | var():  ...
     12. 0.0042124 | 54%| __________xxxxxxxxxxxxx_1 [144] min:  ... | MAX: ... | var():  ...
     13. 0.0041864 | 54%| ________f_xxxxxxxxxxxxx_8 [260] min:  ... | MAX: ... | var():  ...
     14. 0.0039645 | 55%| __________xxxxxxxxxxxxx_1 [140] min:  ... | MAX: ... | var():  ...
     15. 0.0037486 | 55%| ________f_xxxxxxxxxxxxx_8 [252] min:  ... | MAX: ... | var():  ...
     16. 0.0036820 | 55%| ________f_xxxxxxxxxxxxx_8 [268] min:  ... | MAX: ... | var():  ...
     17. 0.0036384 | 56%| __________xxxxxxxxxxxxx_1 [108] min:  ... | MAX: ... | var():  ...
     18. 0.0036112 | 56%| __________xxxxxxxxxxxxx_2 [207] min:  ... | MAX: ... | var():  ...
     19. 0.0035978 | 56%| __________xxxxxxxxxxxxx_1 [132] min:  ... | MAX: ... | var():  ...
     20. 0.0035812 | 57%| __________xxxxxxxxxxxxx_4 [248] min:  ... | MAX: ... | var():  ...
     21. 0.0035558 | 57%| __________xxxxxxxxxxxxx_3 [130] min:  ... | MAX: ... | var():  ...
     22. 0.0035105 | 57%| _______f_Kxxxxxxxxxxxxx_1 [283] min:  ... | MAX: ... | var():  ...
     23. 0.0034851 | 58%| __________xxxxxxxxxxxxx_4 [161] min:  ... | MAX: ... | var():  ...
     24. 0.0034352 | 58%| __________xxxxxxxxxxxxx_2 [250] min:  ... | MAX: ... | var():  ...
     25. 0.0034146 | 59%| __________xxxxxxxxxxxxx_2 [199] min:  ... | MAX: ... | var():  ...
     26. 0.0033744 | 59%| __________xxxxxxxxxxxxx_1 [ 86] min:  ... | MAX: ... | var():  ...
     27. 0.0033624 | 59%| __________xxxxxxxxxxxxx_3 [202] min:  ... | MAX: ... | var():  ...
     28. 0.0032876 | 60%| __________xxxxxxxxxxxxx_4 [169] min:  ... | MAX: ... | var():  ...
     ...
     62. 0.0027483 | 70%| __________xxxxxxxxxxxxx_8 [117] min:  ... | MAX: ... | var():  ...
     63. 0.0027368 | 70%| __________xxxxxxxxxxxxx_2 [ 85] min:  ... | MAX: ... | var():  ...
     64. 0.0027221 | 70%| __________xxxxxxxxxxxxx_1 [211] min:  ... | MAX: ... | var():  ...
     ...
    104. 0.0019674 | 80%| ________f_xxxxxxxxxxxxx_3 [273] min:  ... | MAX: ... | var():  ...
    105. 0.0019597 | 80%| __________xxxxxxxxxxxxx_6 [ 99] min:  ... | MAX: ... | var():  ...
    106. 0.0019199 | 80%| __________xxxxxxxxxxxxx_1 [104] min:  ... | MAX: ... | var():  ...
     ...
    169. 0.0012095 | 90%| __________xxxxxxxxxxxxx_4 [181] min:  ... | MAX: ... | var():  ...
    170. 0.0012017 | 90%| __________xxxxxxxxxxxxx_3 [  9] min:  ... | MAX: ... | var():  ...
    171. 0.0011984 | 90%| __________xxxxxxxxxxxxx_4 [185] min:  ... | MAX: ... | var():  ...
    172. 0.0011926 | 90%| __________xxxxxxxxxxxxx_1 [ 19] min:  ... | MAX: ... | var():  ...
     ...
    272. 0.0005956 | 99%| __________xxxxxxxxxxxxx_2 [ 33] min:  ... | MAX: ... | var():  ...
    273. 0.0005844 | 99%| __________xxxxxxxxxxxxx_2 [127] min:  ... | MAX: ... | var():  ...
    274. 0.0005802 | 99%| __________xxxxxxxxxxxxx_3 [ 54] min:  ... | MAX: ... | var():  ...
    275. 0.0005663 | 99%| __________xxxxxxxxxxxxx_3 [ 32] min:  ... | MAX: ... | var():  ...
    276. 0.0005534 | 99%| __________xxxxxxxxxxxxx_1 [ 83] min:  ... | MAX: ... | var():  ...
     ...
    391. 0.0004347 |100%| __________xxxxxxxxxxxxx_2 [ 82] min:  ... | MAX: ... | var():  ...
    

    So rather plan & reserve a bit more vCPU-cores capacities,
    than to expect it to run such search on a laptop just during a forthcoming lunchtime...


    Let's have a working plan

    the intended auto-find service, due to many reasons, is not a part of scikit-learn, however, the goal is achievable.

    We will use the following adaptation steps, that will allow us to make it work:

    1. we'll rely on scikit-learn abilities to search for a best tandem of a [learner + hyperparameters] for a well-defined AI/ML problem

    2. we'll rely on numpy powers for obvious reasons to support scikit-learn phase

    3. we'll rely on rather a proper scikit-learn AI/ML-engine processing & proces-controls ( pipeline, GridSearchCV et al ), which are by far better optimised on low-level performance for such massive-scale-attacks, than to try to depend on an "externally" ordinated for-looping ( which loses all the valuable cache/data-locality advances ) and is known to be of a remarkable performance disadvantage.

    4. we'll substitute the wished autodiscovery by a fast, one-step, a-priori DataSET adaptation

    5. we'll let scikit-learn decide ( quantitatively indicate ) which pre-lagged features, artificially synthesised into compound DataSET elaborated in [4] have finally the best predictive powers


    [4] DataSET adaptation with smart numpy aids:

    Your DataSET consists of an unspecified count of TimeSeries-data. For each such, you assume some pre-lagging may have better predictive powers, that you would like to find ( quantitatively support the selection of such, for the final ML-predictor ).

    So let's first construct for each TimeSerie DataSET[i,:] in the source-part of the DataSET an extended part of the DataSET, which contains the respectively pre-lagged versions of this TimeSerie:

    >>> def generate_TimeSERIE_preLAGs( aTimeSERIE, pre_lag_window_depth ):
    ...     #
    ...     # COURTESY & THANKS TO:
    ...     #                     Nicolas P. Rougier, INRIA
    ...     #             Author: Joe Kington / Erik Rigtorp
    ...     #
    ...     shape   = ( aTimeSERIE.size - pre_lag_window_depth + 1,
    ...                 pre_lag_window_depth
    ...                 )
    ...     strides = ( aTimeSERIE.itemsize,
    ...                 aTimeSERIE.itemsize
    ...                 )
    ...     return np.lib.stride_tricks.as_strided( aTimeSERIE,
    ...                                             shape,
    ...                                             strides = strides
    ...                                             )
    ...
    >>> xVECTOR = np.arange( 10 )
    >>>
    >>> pre_laggz_on_xVECTOR = generate_TimeSERIE_preLAGs( xVECTOR, 4 )
    >>>
    >>> pre_laggz_on_xVECTOR
    array([[0, 1, 2, 3],
           [1, 2, 3, 4],
           [2, 3, 4, 5],
           [3, 4, 5, 6],
           [4, 5, 6, 7],
           [5, 6, 7, 8],
           [6, 7, 8, 9]])
    >>>
    

    With such extended ( wider, and you know that a lot ) but static extendedDataSET, containing now both the original TimeSERIE vector and all the wished-to-test pre-lagged versions thereof, your ML-search starts.

    Phase [1.A]
    Initially use scikit-learn tools for the best feature-selection supporting your hypothesis
    +
    Phase [1.B]
    next start hyperparameter optimisation for the best cross-validation results supporting a maximum generalisation ability of the learner.

    The phase [1.B], naturally ought to run on a sub-set of the extendedDataSET ( as was intentionally extended for the sake of scikit-learn evaluation in feature-selection-phase [1.A] ).


    Epilogue:
    Memento mori: Quants do not like this ... but

    For the sake of your further interests in TimeSeries analyses & quantitative modelling, you might be interested in the best answer on this >>>

    enter image description here

    Correlation does not imply Causation, so even more care is needed in making decisions to be executed ( paper can always handle much more, than the markets :o) ).