python pandas machine-learning keras finance

machine learning-how to use the past 20 rows as an input for X for each Y value

I have a very simple machine learning code here:

# load dataset
dataframe = pandas.read_csv("USDJPY,5.csv", header=None)
dataset = dataframe.values
X = dataset[:,0:59]
Y = dataset[:,59]
#fit Dense Keras model
model.fit(X, Y, validation_data=(x,y_test), epochs=150, batch_size=10)

My X values are 59 features with the 60th column being my Y value, a simple 1 or 0 classification label.

Considering that I am using financial data, I would like to lookback the past 20 X values in order to predict the Y value.

So how could I make my algorithm use the past 20 rows as an input for X for each Y value?

I'm relatively new to machine learning and spent much time looking online for a solution to my problem yet I could not find anything simple as my case.

Any ideas?

Solution

This is typically done with Recurrent Neural Networks (RNN), that retain some memory of the previous input, when the next input is received. Thats a very breif explanation of what goes on, but there are plenty of sources on the internet to better wrap your understanding of how they work.

Lets break this down in a simple example. Lets say you have 5 samples and 5 features of data, and you want two stagger the data by 2 rows instead of 20. Here is your data (assuming 1 stock and the oldest price value is first). And we can think of each row as a day of the week

ar = np.random.randint(10,100,(5,5))

[[43, 79, 67, 20, 13],    #<---Monday---
 [80, 86, 78, 76, 71],    #<---Tuesday---
 [35, 23, 62, 31, 59],    #<---Wednesday---
 [67, 53, 92, 80, 15],    #<---Thursday---
 [60, 20, 10, 45, 47]]    #<---Firday---

To use an LSTM in keras, your data needs to be 3-D, vs the current 2-D structure it is now, and the notation for each diminsion is (samples,timesteps,features). Currently you only have (samples,features) so you would need to augment the data.

a2 = np.concatenate([ar[x:x+2,:] for x in range(ar.shape[0]-1)])
a2 = a2.reshape(4,2,5)

[[[43, 79, 67, 20, 13],    #See Monday First
  [80, 86, 78, 76, 71]],   #See Tuesday second ---> Predict Value originally set for Tuesday
 [[80, 86, 78, 76, 71],    #See Tuesday First
  [35, 23, 62, 31, 59]],   #See Wednesday Second ---> Predict Value originally set for Wednesday
 [[35, 23, 62, 31, 59],    #See Wednesday Value First
  [67, 53, 92, 80, 15]],   #See Thursday Values Second ---> Predict value originally set for Thursday
 [[67, 53, 92, 80, 15],    #And so on
  [60, 20, 10, 45, 47]]])

Notice how the data is staggered and 3 dimensional. Now just make an LSTM network. Y remains 2-D since this is a many-to-one structure, however you need to clip the first value.

model = Sequential()
model.add(LSTM(hidden_dims,input_shape=(a2.shape[1],a2.shape[2]))
model.add(Dense(1))

This is just a brief example to get you moving. There are many different setups that will work (including not using RNN), you need to find the correct one for your data.