Search code examples
pythonkerasrecurrent-neural-network

Advisable ways to shape my data as input for a RNN


I have a dataframe X, where each row is a data point in time and each column is a feature. The label/target variable Y is univariate. One of the columns of X is the lagged values of Y.

The RNN input is of the shape (batch_size, n_timesteps, n_feature).

From what I've been reading on this site, batch_size should be as big as possible without running out of memory. My main doubt is about n_timesteps. and n_features.

I think n_feature is the number of columns in the X dataframe.

What about the n_timesteps?


Solution

  • Consider the following dataframe with the features temperature, pressure, and humidity:

    import pandas as pd
    import numpy as np
    
    X = pd.DataFrame(data={
        'temperature': np.random.random((1, 20)).ravel(),
        'pressure': np.random.random((1, 20)).ravel(),
        'humidity': np.random.random((1, 20)).ravel(),
    })
    
    print(X.to_markdown())
    
    |    |   temperature |   pressure |   humidity |
    |---:|--------------:|-----------:|-----------:|
    |  0 |     0.205905  |  0.0824903 | 0.629692   |
    |  1 |     0.280732  |  0.107473  | 0.588672   |
    |  2 |     0.0113955 |  0.746447  | 0.156373   |
    |  3 |     0.205553  |  0.957509  | 0.184099   |
    |  4 |     0.741808  |  0.689842  | 0.0891679  |
    |  5 |     0.408923  |  0.0685223 | 0.317061   |
    |  6 |     0.678908  |  0.064342  | 0.219736   |
    |  7 |     0.600087  |  0.369806  | 0.632653   |
    |  8 |     0.944992  |  0.552085  | 0.31689    |
    |  9 |     0.183584  |  0.102664  | 0.545828   |
    | 10 |     0.391229  |  0.839631  | 0.00644447 |
    | 11 |     0.317618  |  0.288042  | 0.796232   |
    | 12 |     0.789993  |  0.938448  | 0.568106   |
    | 13 |     0.0615843 |  0.704498  | 0.0554465  |
    | 14 |     0.172264  |  0.615129  | 0.633329   |
    | 15 |     0.162544  |  0.439882  | 0.0185174  |
    | 16 |     0.48592   |  0.280436  | 0.550733   |
    | 17 |     0.0370098 |  0.790943  | 0.592646   |
    | 18 |     0.371475  |  0.976977  | 0.460522   |
    | 19 |     0.493215  |  0.381539  | 0.995716   |
    

    Now, if you want to use this kind of data for time series prediction with a RNN model, you usually consider one row in the data frame as one timestep. Converting the dataframe into an array might also help you understand what the timesteps are:

    print(np.expand_dims(X.to_numpy(), axis=1).shape)
    # (20, 1, 3)
    

    First, I obtain an array of the shape(20, 3) or in other words, 20 samples and each sample has three features. I then explicitly add a time dimension to the array, resulting in the shape(20, 1, 3), meaning that the data set consists of 20 samples and each sample has one time step and for each time step you have 3 features. Now, you can use this data directly as input for a RNN.