I'm trying to predict the following:
list( [
close price (current day)
- open price (current day)
] )
by using the following as input:
list( [
open price (current day)
- close price (yesterday)
] )
However, my test_prediction
result is a prediction of the incorrect thing.
Predictions from both the sklearn
and statsmodels
linear regression models show ~100% correlation between input data (test_data
) and prediction results
, whereas the prediction results
should be correlated to test_result
.
What am I doing wrong (or missing here), and how do I fix it? Code will generate 4 plots showing correlations between different lists.
###### Working usable example and code below ######
import numpy as np
from plotly.offline import plot
import plotly.graph_objs as go
from sklearn import linear_model
import statsmodels.api as sm
def xy_corr( x, y, fname ):
trace1 = go.Scatter( x = x,
y = y,
mode = 'markers',
marker = dict( size = 6,
color = 'black',
),
showlegend = False
)
layout = go.Layout( title = fname )
fig = go.Figure( data = [trace1],
layout = layout
)
plot( fig, filename = fname + '.html' )
open_p = [23215, 23659, 23770, 23659, 23659, 23993, 23987, 23935, 24380, 24271, 24314, 24018, 23928, 23240, 24193, 23708, 23525, 23640, 23494, 23333, 23451, 23395, 23395, 23925, 23936, 24036, 24008, 24248, 24249, 24599, 24683, 24708, 24510, 24483, 24570, 24946, 25008, 24880, 24478, 24421, 24630, 24540, 24823, 25090, 24610, 24866, 24578, 24686, 24465, 24225, 24526, 24645, 24780, 24538, 24895, 24921, 24743, 25163, 25163, 25316, 25320, 25158, 25375, 25430, 25466, 25231, 25103, 25138, 25138, 25496, 25502, 25610, 25625, 25810, 25789, 25533, 25785, 25698, 25373, 25558, 25594, 25026, 24630, 24509, 24535, 24205, 24465, 23847, 24165, 23840, 24216, 24355, 24158, 23203, 23285, 23423, 23786, 23729, 23944, 23637]
close_p = [23656, 23758, 23663, 23659, 23989, 23978, 24142, 24152, 24279, 24271, 24393, 23942, 23640, 24102, 23710, 23708, 23705, 23693, 23561, 23441, 23395, 23395, 23990, 23900, 24158, 24188, 24241, 24248, 24699, 24678, 24715, 24523, 24486, 24483, 24947, 24904, 24923, 24478, 24434, 24421, 24409, 24705, 25047, 24642, 24875, 24866, 24698, 24463, 24262, 24396, 24633, 24645, 24528, 24895, 24895, 24839, 25178, 25163, 25315, 25323, 25149, 25387, 25375, 25469, 25231, 25073, 25138, 25138, 25448, 25611, 25705, 25623, 25813, 25798, 25560, 25518, 25743, 25305, 25654, 25579, 25315, 24783, 24508, 24532, 24208, 24176, 24047, 24148, 24165, 24159, 24286, 24249, 23635, 23128, 23438, 23869, 23420, 23756, 23705, 24018]
open_prev_close_diff = np.array( [ open_p[i] - close_p[i-1] for i in range( 1, len( open_p ) )] )[np.newaxis].T
open_current_close_diff = np.array( [close_p[i] - open_p[i] for i in range( 1, len( open_p ) )] )
train_data = open_prev_close_diff[ :80]
test_data = open_prev_close_diff[80:]
train_result = open_current_close_diff[ :80]
test_result = open_current_close_diff[80:]
regressor = linear_model.LinearRegression()
regressor.fit( train_data, train_result )
test_prediction = np.array( [int(i) for i in regressor.predict( test_data )] )
xy_corr( [int(i) for i in test_result], test_prediction, 'known_result_and_prediction_result_sklearn')
xy_corr( [int(i) for i in test_data], test_prediction, 'input_data_and_prediction_result_sklearn' )
olsmod = sm.OLS( train_result, train_data )
olsres = olsmod.fit()
test_prediction = np.array( [int(i) for i in olsres.predict( test_data )] )
xy_corr( [int(i) for i in test_result], test_prediction, 'known_result_and_prediction_result_smOLS')
xy_corr( [int(i) for i in test_data], test_prediction, 'input_data_and_prediction_result_smOLS' )
With a hope no one would consider this impolite and/or harmfull,
let me cite a lovely point of view on the underlying assumption from “Correlation does not imply Causation“: that many contemporary Quantitative Finance modellers neglect or abstract from:
For any two correlated events, A
and B
,
the following relationships are possible:
A
causes B
; ( direct causation )
B
causes A
; ( reverse causation )
A
and B
are consequences of a common cause, but do not cause each other;
A
causes B
and B
causes A
( bidirectional or cyclic causation );
A
causes C
which causes B
( indirect causation );
There is no connection between A
and B
; the correlation is a coincidence.
Thus there can be no conclusion made regarding the existence or the direction of a cause-and-effect relationship only from the fact that A
and B
are correlated.
There ought be little to no surprise, that a LinearRegressor()
did not produce anything other but a line -- i.e. that each of it's predictions is indeed on the very line of the model. So, half of the plots is clearly a must-do for this kind of predictor.
Q.E.D.
The other half does not show anything else but the level of over-simplification the linear model has employed in spite of the observable reality.
Sure, the real behaviour experienced is NOT linear ( but do not blame the predictor to "fail to fit", it was it's duty to formulate a MSE
-minimiser driven linear model, it could not find any better linear fit on the training part of the DataSET
and if it were trained over an y = x^2
synthetic DataSET
( for which one has an a-priori knowledge of the parabolic shape ) , again it will only produce a linear model, with a minimum MSE
over the training-part fraction of the DataSET
and we are all sure pretty in advance, that ANY line will thus yield totally flawed predictions OoS
, but NOT due to it's failure to work well, but due to a principal nonsense in the externally indoctrinated attempt to use a linear-model predictor in a ( knowingly quadratic ) context, where it does not follow the ( known ) reality ).
as an elementary quantitative view, much simpler than a rigorous Kolmogorov-Smirnov
test, for the un-expressed hypothesis,
check the percentage fraction of negative difference in intra-day gap ( Open[i] - Close[i-1] )
[ 75% ]
from a rather shallow DataSET
of just 100 samples
against
a negative difference of the day candle body ( Close[i] - Open[i] )
with just [ 55% ]
from a rather shallow DataSET
of 100 samples
Anyway, training as few as 80 day-samples has a poor chance to generalise well even on much better engineered predictor-model and ought to focus not only on a better generalisation ability, but also on avoiding seasonal biases et al.
To have some idea where the ML goes into this field, my best preforming AI/ML-models have about
0k3
features ( many of 'em highly non-linear synthetic features ) and get intensively trained accross30k+ DataSETs
with carefully carving out their risk to overfitting and searching vast space of the learner-engines' hyperparameterStateSPACE
|
|>>> QuantFX.get_LDF_GDF_fromGivenRANGE( [ open_PRICE[i] - close_PRICE[i-1] for i in range( 1, len( close_PRICE ) ) ], nBINs_ = 31, aPrefixTEXT_ = "" )
0: ~ -432.00 LDF = 1 |____ 1.0 % _||____ 1 %
1: ~ -408.10 LDF = 1 |____ 1.0 % _||____ 2 %
2: ~ -384.19 LDF = 1 |____ 1.0 % _||____ 3 %
3: ~ -360.29 LDF = 0 |____ 0.0 % _||____ 3 %
4: ~ -336.39 LDF = 1 |____ 1.0 % _||____ 4 %
5: ~ -312.48 LDF = 1 |____ 1.0 % _||____ 5 %
6: ~ -288.58 LDF = 1 |____ 1.0 % _||____ 6 %
7: ~ -264.68 LDF = 0 |____ 0.0 % _||____ 6 %
8: ~ -240.77 LDF = 1 |____ 1.0 % _||____ 7 %
9: ~ -216.87 LDF = 3 |____ 3.0 % _||___ 10 %
10: ~ -192.97 LDF = 2 |____ 2.0 % _||___ 12 %
11: ~ -169.06 LDF = 1 |____ 1.0 % _||___ 13 %
12: ~ -145.16 LDF = 1 |____ 1.0 % _||___ 14 %
13: ~ -121.26 LDF = 2 |____ 2.0 % _||___ 16 %
14: ~ -97.35 LDF = 5 |____ 5.1 % _||___ 21 %
15: ~ -73.45 LDF = 3 |____ 3.0 % _||___ 24 %
16: ~ -49.55 LDF = 5 |____ 5.1 % _||___ 29 %
17: ~ -25.65 LDF = 18 |___ 18.2 % _||___ 47 %
18: ~ -1.74 LDF = 28 |___ 28.3 % _||___ 75 %
19: ~ 22.16 LDF = 5 |____ 5.1 % _||___ 80 %
20: ~ 46.06 LDF = 5 |____ 5.1 % _||___ 85 %
21: ~ 69.97 LDF = 2 |____ 2.0 % _||___ 87 %
22: ~ 93.87 LDF = 1 |____ 1.0 % _||___ 88 %
23: ~ 117.77 LDF = 4 |____ 4.0 % _||___ 92 %
24: ~ 141.68 LDF = 1 |____ 1.0 % _||___ 93 %
25: ~ 165.58 LDF = 1 |____ 1.0 % _||___ 94 %
26: ~ 189.48 LDF = 1 |____ 1.0 % _||___ 95 %
27: ~ 213.39 LDF = 1 |____ 1.0 % _||___ 96 %
28: ~ 237.29 LDF = 0 |____ 0.0 % _||___ 96 %
29: ~ 261.19 LDF = 1 |____ 1.0 % _||___ 97 %
30: ~ 285.10 LDF = 2 |____ 2.0 % _||__ 100 %
+0:00:06.234000
|
|
|>>> QuantFX.get_LDF_GDF_fromGivenRANGE( [ close_PRICE[i] - open_PRICE[i] for i in range( 1, len( close_PRICE ) ) ], nBINs_ = 31, aPrefixTEXT_ = "" )
0: ~ -523.00 LDF = 2 |____ 2.0 % _||____ 2 %
1: ~ -478.32 LDF = 1 |____ 1.0 % _||____ 3 %
2: ~ -433.65 LDF = 3 |____ 3.0 % _||____ 6 %
3: ~ -388.97 LDF = 1 |____ 1.0 % _||____ 7 %
4: ~ -344.29 LDF = 1 |____ 1.0 % _||____ 8 %
5: ~ -299.61 LDF = 2 |____ 2.0 % _||___ 10 %
6: ~ -254.94 LDF = 7 |____ 7.1 % _||___ 17 %
7: ~ -210.26 LDF = 3 |____ 3.0 % _||___ 20 %
8: ~ -165.58 LDF = 2 |____ 2.0 % _||___ 22 %
9: ~ -120.90 LDF = 5 |____ 5.1 % _||___ 27 %
10: ~ -76.23 LDF = 6 |____ 6.1 % _||___ 33 %
11: ~ -31.55 LDF = 22 |___ 22.2 % _||___ 55 %
12: ~ 13.13 LDF = 7 |____ 7.1 % _||___ 62 %
13: ~ 57.81 LDF = 5 |____ 5.1 % _||___ 67 %
14: ~ 102.48 LDF = 4 |____ 4.0 % _||___ 71 %
15: ~ 147.16 LDF = 8 |____ 8.1 % _||___ 79 %
16: ~ 191.84 LDF = 6 |____ 6.1 % _||___ 85 %
17: ~ 236.52 LDF = 2 |____ 2.0 % _||___ 87 %
18: ~ 281.19 LDF = 3 |____ 3.0 % _||___ 90 %
19: ~ 325.87 LDF = 2 |____ 2.0 % _||___ 92 %
20: ~ 370.55 LDF = 2 |____ 2.0 % _||___ 94 %
21: ~ 415.23 LDF = 3 |____ 3.0 % _||___ 97 %
22: ~ 459.90 LDF = 0 |____ 0.0 % _||___ 97 %
23: ~ 504.58 LDF = 0 |____ 0.0 % _||___ 97 %
24: ~ 549.26 LDF = 0 |____ 0.0 % _||___ 97 %
25: ~ 593.94 LDF = 1 |____ 1.0 % _||___ 98 %
26: ~ 638.61 LDF = 0 |____ 0.0 % _||___ 98 %
27: ~ 683.29 LDF = 0 |____ 0.0 % _||___ 98 %
28: ~ 727.97 LDF = 0 |____ 0.0 % _||___ 98 %
29: ~ 772.65 LDF = 0 |____ 0.0 % _||___ 98 %
30: ~ 817.32 LDF = 1 |____ 1.0 % _||__ 100 %
+0:01:13.172000