python statistics scikit-learn statsmodels algorithmic-trading

Linear regression prediction predicting input data (test data), and not test results

I'm trying to predict the following:

list( [ close price (current day) - open price (current day) ] )

by using the following as input:

list( [ open price (current day) - close price (yesterday) ] )

However, my test_prediction result is a prediction of the incorrect thing.

Predictions from both the sklearn and statsmodels linear regression models show ~100% correlation between input data (test_data) and prediction results, whereas the prediction results should be correlated to test_result.

What am I doing wrong (or missing here), and how do I fix it? Code will generate 4 plots showing correlations between different lists.

###### Working usable example and code below ######

import numpy as np
from plotly.offline import plot
import plotly.graph_objs as go

from sklearn import linear_model
import statsmodels.api as sm

def xy_corr( x, y, fname ):
    trace1 = go.Scatter( x          = x, 
                         y          = y, 
                         mode       = 'markers',
                         marker     = dict( size  = 6,
                                            color = 'black',
                                            ),
                         showlegend = False
                         )
    layout = go.Layout( title  = fname )

    fig    = go.Figure( data   = [trace1],
                        layout = layout
                        )
    plot( fig, filename = fname + '.html' )

open_p  = [23215, 23659, 23770, 23659, 23659, 23993, 23987, 23935, 24380, 24271, 24314, 24018, 23928, 23240, 24193, 23708, 23525, 23640, 23494, 23333, 23451, 23395, 23395, 23925, 23936, 24036, 24008, 24248, 24249, 24599, 24683, 24708, 24510, 24483, 24570, 24946, 25008, 24880, 24478, 24421, 24630, 24540, 24823, 25090, 24610, 24866, 24578, 24686, 24465, 24225, 24526, 24645, 24780, 24538, 24895, 24921, 24743, 25163, 25163, 25316, 25320, 25158, 25375, 25430, 25466, 25231, 25103, 25138, 25138, 25496, 25502, 25610, 25625, 25810, 25789, 25533, 25785, 25698, 25373, 25558, 25594, 25026, 24630, 24509, 24535, 24205, 24465, 23847, 24165, 23840, 24216, 24355, 24158, 23203, 23285, 23423, 23786, 23729, 23944, 23637]
close_p = [23656, 23758, 23663, 23659, 23989, 23978, 24142, 24152, 24279, 24271, 24393, 23942, 23640, 24102, 23710, 23708, 23705, 23693, 23561, 23441, 23395, 23395, 23990, 23900, 24158, 24188, 24241, 24248, 24699, 24678, 24715, 24523, 24486, 24483, 24947, 24904, 24923, 24478, 24434, 24421, 24409, 24705, 25047, 24642, 24875, 24866, 24698, 24463, 24262, 24396, 24633, 24645, 24528, 24895, 24895, 24839, 25178, 25163, 25315, 25323, 25149, 25387, 25375, 25469, 25231, 25073, 25138, 25138, 25448, 25611, 25705, 25623, 25813, 25798, 25560, 25518, 25743, 25305, 25654, 25579, 25315, 24783, 24508, 24532, 24208, 24176, 24047, 24148, 24165, 24159, 24286, 24249, 23635, 23128, 23438, 23869, 23420, 23756, 23705, 24018]

open_prev_close_diff    = np.array( [ open_p[i] - close_p[i-1] for i in range( 1, len( open_p ) )] )[np.newaxis].T
open_current_close_diff = np.array( [close_p[i] -  open_p[i]   for i in range( 1, len( open_p ) )] )

train_data = open_prev_close_diff[  :80]
test_data  = open_prev_close_diff[80:]

train_result = open_current_close_diff[  :80]
test_result  = open_current_close_diff[80:]

regressor    = linear_model.LinearRegression()
regressor.fit( train_data, train_result )

test_prediction = np.array( [int(i) for i in regressor.predict( test_data )] )

xy_corr( [int(i) for i in test_result], test_prediction, 'known_result_and_prediction_result_sklearn')
xy_corr( [int(i) for i in test_data],   test_prediction, 'input_data_and_prediction_result_sklearn'  )

olsmod = sm.OLS( train_result, train_data )
olsres = olsmod.fit()

test_prediction = np.array( [int(i) for i in olsres.predict( test_data )] )

xy_corr( [int(i) for i in test_result], test_prediction, 'known_result_and_prediction_result_smOLS')
xy_corr( [int(i) for i in test_data],   test_prediction, 'input_data_and_prediction_result_smOLS'  )

Solution

_{With a hope no one would consider this impolite and/or harmfull,}

let me cite a lovely point of view on the underlying assumption from “Correlation does not imply Causation“: that many contemporary Quantitative Finance modellers neglect or abstract from:

For any two correlated events, A and B,
the following relationships are possible:
A causes B; ( direct causation )
B causes A; ( reverse causation )
A and B are consequences of a common cause, but do not cause each other;
A causes B and B causes A ( bidirectional or cyclic causation );
A causes C which causes B ( indirect causation );
There is no connection between A and B; the correlation is a coincidence.

Thus there can be no conclusion made regarding the existence or the direction of a cause-and-effect relationship only from the fact that A and B are correlated.

UPDATE:

There ought be little to no surprise, that a LinearRegressor() did not produce anything other but a line -- i.e. that each of it's predictions is indeed on the very line of the model. So, half of the plots is clearly a must-do for this kind of predictor.
Q.E.D.

The other half does not show anything else but the level of over-simplification the linear model has employed in spite of the observable reality.

Sure, the real behaviour experienced is NOT linear ( but do not blame the predictor to "fail to fit", it was it's duty to formulate a MSE-minimiser driven linear model, it could not find any better linear fit on the training part of the DataSET and if it were trained over an y = x^2 synthetic DataSET _{( for which one has an a-priori knowledge of the parabolic shape )}, again it will only produce a linear model, with a minimum MSE over the training-part fraction of the DataSET and we are all sure pretty in advance, that ANY line will thus yield totally flawed predictions OoS, but NOT due to it's failure to work well, but due to a principal nonsense in the externally indoctrinated attempt to use a linear-model predictor in a ( knowingly quadratic ) context, where it does not follow the ( known ) reality ).

DataSET:

as an elementary quantitative view, much simpler than a rigorous Kolmogorov-Smirnov test, for the un-expressed hypothesis,
check the percentage fraction of negative difference in intra-day gap ( Open[i] - Close[i-1] ) [ 75% ] from a rather shallow DataSET of just 100 samples
against
a negative difference of the day candle body ( Close[i] - Open[i] ) with just [ 55% ] from a rather shallow DataSET of 100 samples

Anyway, training as few as 80 day-samples has a poor chance to generalise well even on much better engineered predictor-model and ought to focus not only on a better generalisation ability, but also on avoiding seasonal biases et al.

To have some idea where the ML goes into this field, my best preforming AI/ML-models have about 0k3 features ( many of 'em highly non-linear synthetic features ) and get intensively trained accross 30k+ DataSETs with carefully carving out their risk to overfitting and searching vast space of the learner-engines' hyperparameter StateSPACE

|
|>>> QuantFX.get_LDF_GDF_fromGivenRANGE( [ open_PRICE[i] - close_PRICE[i-1] for i in range( 1, len( close_PRICE ) ) ], nBINs_ = 31, aPrefixTEXT_ = "" )
     0:  ~ -432.00 LDF =      1 |____ 1.0 % _||____ 1 %
     1:  ~ -408.10 LDF =      1 |____ 1.0 % _||____ 2 %
     2:  ~ -384.19 LDF =      1 |____ 1.0 % _||____ 3 %
     3:  ~ -360.29 LDF =      0 |____ 0.0 % _||____ 3 %
     4:  ~ -336.39 LDF =      1 |____ 1.0 % _||____ 4 %
     5:  ~ -312.48 LDF =      1 |____ 1.0 % _||____ 5 %
     6:  ~ -288.58 LDF =      1 |____ 1.0 % _||____ 6 %
     7:  ~ -264.68 LDF =      0 |____ 0.0 % _||____ 6 %
     8:  ~ -240.77 LDF =      1 |____ 1.0 % _||____ 7 %
     9:  ~ -216.87 LDF =      3 |____ 3.0 % _||___ 10 %
    10:  ~ -192.97 LDF =      2 |____ 2.0 % _||___ 12 %
    11:  ~ -169.06 LDF =      1 |____ 1.0 % _||___ 13 %
    12:  ~ -145.16 LDF =      1 |____ 1.0 % _||___ 14 %
    13:  ~ -121.26 LDF =      2 |____ 2.0 % _||___ 16 %
    14:  ~  -97.35 LDF =      5 |____ 5.1 % _||___ 21 %
    15:  ~  -73.45 LDF =      3 |____ 3.0 % _||___ 24 %
    16:  ~  -49.55 LDF =      5 |____ 5.1 % _||___ 29 %
    17:  ~  -25.65 LDF =     18 |___ 18.2 % _||___ 47 %
    18:  ~   -1.74 LDF =     28 |___ 28.3 % _||___ 75 %
    19:  ~   22.16 LDF =      5 |____ 5.1 % _||___ 80 %
    20:  ~   46.06 LDF =      5 |____ 5.1 % _||___ 85 %
    21:  ~   69.97 LDF =      2 |____ 2.0 % _||___ 87 %
    22:  ~   93.87 LDF =      1 |____ 1.0 % _||___ 88 %
    23:  ~  117.77 LDF =      4 |____ 4.0 % _||___ 92 %
    24:  ~  141.68 LDF =      1 |____ 1.0 % _||___ 93 %
    25:  ~  165.58 LDF =      1 |____ 1.0 % _||___ 94 %
    26:  ~  189.48 LDF =      1 |____ 1.0 % _||___ 95 %
    27:  ~  213.39 LDF =      1 |____ 1.0 % _||___ 96 %
    28:  ~  237.29 LDF =      0 |____ 0.0 % _||___ 96 %
    29:  ~  261.19 LDF =      1 |____ 1.0 % _||___ 97 %
    30:  ~  285.10 LDF =      2 |____ 2.0 % _||__ 100 %
+0:00:06.234000
|
|
|>>> QuantFX.get_LDF_GDF_fromGivenRANGE( [ close_PRICE[i] - open_PRICE[i] for i in range( 1, len( close_PRICE ) ) ], nBINs_ = 31, aPrefixTEXT_ = "" )
     0:  ~ -523.00 LDF =      2 |____ 2.0 % _||____ 2 %
     1:  ~ -478.32 LDF =      1 |____ 1.0 % _||____ 3 %
     2:  ~ -433.65 LDF =      3 |____ 3.0 % _||____ 6 %
     3:  ~ -388.97 LDF =      1 |____ 1.0 % _||____ 7 %
     4:  ~ -344.29 LDF =      1 |____ 1.0 % _||____ 8 %
     5:  ~ -299.61 LDF =      2 |____ 2.0 % _||___ 10 %
     6:  ~ -254.94 LDF =      7 |____ 7.1 % _||___ 17 %
     7:  ~ -210.26 LDF =      3 |____ 3.0 % _||___ 20 %
     8:  ~ -165.58 LDF =      2 |____ 2.0 % _||___ 22 %
     9:  ~ -120.90 LDF =      5 |____ 5.1 % _||___ 27 %
    10:  ~  -76.23 LDF =      6 |____ 6.1 % _||___ 33 %
    11:  ~  -31.55 LDF =     22 |___ 22.2 % _||___ 55 %
    12:  ~   13.13 LDF =      7 |____ 7.1 % _||___ 62 %
    13:  ~   57.81 LDF =      5 |____ 5.1 % _||___ 67 %
    14:  ~  102.48 LDF =      4 |____ 4.0 % _||___ 71 %
    15:  ~  147.16 LDF =      8 |____ 8.1 % _||___ 79 %
    16:  ~  191.84 LDF =      6 |____ 6.1 % _||___ 85 %
    17:  ~  236.52 LDF =      2 |____ 2.0 % _||___ 87 %
    18:  ~  281.19 LDF =      3 |____ 3.0 % _||___ 90 %
    19:  ~  325.87 LDF =      2 |____ 2.0 % _||___ 92 %
    20:  ~  370.55 LDF =      2 |____ 2.0 % _||___ 94 %
    21:  ~  415.23 LDF =      3 |____ 3.0 % _||___ 97 %
    22:  ~  459.90 LDF =      0 |____ 0.0 % _||___ 97 %
    23:  ~  504.58 LDF =      0 |____ 0.0 % _||___ 97 %
    24:  ~  549.26 LDF =      0 |____ 0.0 % _||___ 97 %
    25:  ~  593.94 LDF =      1 |____ 1.0 % _||___ 98 %
    26:  ~  638.61 LDF =      0 |____ 0.0 % _||___ 98 %
    27:  ~  683.29 LDF =      0 |____ 0.0 % _||___ 98 %
    28:  ~  727.97 LDF =      0 |____ 0.0 % _||___ 98 %
    29:  ~  772.65 LDF =      0 |____ 0.0 % _||___ 98 %
    30:  ~  817.32 LDF =      1 |____ 1.0 % _||__ 100 %
+0:01:13.172000