Search code examples
pythonpandasmachine-learningkerasfinance

machine learning-how to use the past 20 rows as an input for X for each Y value


I have a very simple machine learning code here:

# load dataset
dataframe = pandas.read_csv("USDJPY,5.csv", header=None)
dataset = dataframe.values
X = dataset[:,0:59]
Y = dataset[:,59]
#fit Dense Keras model
model.fit(X, Y, validation_data=(x,y_test), epochs=150, batch_size=10)

My X values are 59 features with the 60th column being my Y value, a simple 1 or 0 classification label.

Considering that I am using financial data, I would like to lookback the past 20 X values in order to predict the Y value.

So how could I make my algorithm use the past 20 rows as an input for X for each Y value?

I'm relatively new to machine learning and spent much time looking online for a solution to my problem yet I could not find anything simple as my case.

Any ideas?


Solution

  • This is typically done with Recurrent Neural Networks (RNN), that retain some memory of the previous input, when the next input is received. Thats a very breif explanation of what goes on, but there are plenty of sources on the internet to better wrap your understanding of how they work.

    Lets break this down in a simple example. Lets say you have 5 samples and 5 features of data, and you want two stagger the data by 2 rows instead of 20. Here is your data (assuming 1 stock and the oldest price value is first). And we can think of each row as a day of the week

    ar = np.random.randint(10,100,(5,5))
    
    [[43, 79, 67, 20, 13],    #<---Monday---
     [80, 86, 78, 76, 71],    #<---Tuesday---
     [35, 23, 62, 31, 59],    #<---Wednesday---
     [67, 53, 92, 80, 15],    #<---Thursday---
     [60, 20, 10, 45, 47]]    #<---Firday---
    

    To use an LSTM in keras, your data needs to be 3-D, vs the current 2-D structure it is now, and the notation for each diminsion is (samples,timesteps,features). Currently you only have (samples,features) so you would need to augment the data.

    a2 = np.concatenate([ar[x:x+2,:] for x in range(ar.shape[0]-1)])
    a2 = a2.reshape(4,2,5)
    
    [[[43, 79, 67, 20, 13],    #See Monday First
      [80, 86, 78, 76, 71]],   #See Tuesday second ---> Predict Value originally set for Tuesday
     [[80, 86, 78, 76, 71],    #See Tuesday First
      [35, 23, 62, 31, 59]],   #See Wednesday Second ---> Predict Value originally set for Wednesday
     [[35, 23, 62, 31, 59],    #See Wednesday Value First
      [67, 53, 92, 80, 15]],   #See Thursday Values Second ---> Predict value originally set for Thursday
     [[67, 53, 92, 80, 15],    #And so on
      [60, 20, 10, 45, 47]]])
    

    Notice how the data is staggered and 3 dimensional. Now just make an LSTM network. Y remains 2-D since this is a many-to-one structure, however you need to clip the first value.

    model = Sequential()
    model.add(LSTM(hidden_dims,input_shape=(a2.shape[1],a2.shape[2]))
    model.add(Dense(1))
    

    This is just a brief example to get you moving. There are many different setups that will work (including not using RNN), you need to find the correct one for your data.