I have a dataframe X, where each row is a data point in time and each column is a feature. The label/target variable Y is univariate. One of the columns of X is the lagged values of Y.
The RNN input is of the shape (batch_size, n_timesteps, n_feature).
From what I've been reading on this site, batch_size should be as big as possible without running out of memory. My main doubt is about n_timesteps. and n_features.
I think n_feature is the number of columns in the X dataframe.
What about the n_timesteps?
Consider the following dataframe
with the features temperature, pressure, and humidity:
import pandas as pd
import numpy as np
X = pd.DataFrame(data={
'temperature': np.random.random((1, 20)).ravel(),
'pressure': np.random.random((1, 20)).ravel(),
'humidity': np.random.random((1, 20)).ravel(),
})
print(X.to_markdown())
| | temperature | pressure | humidity |
|---:|--------------:|-----------:|-----------:|
| 0 | 0.205905 | 0.0824903 | 0.629692 |
| 1 | 0.280732 | 0.107473 | 0.588672 |
| 2 | 0.0113955 | 0.746447 | 0.156373 |
| 3 | 0.205553 | 0.957509 | 0.184099 |
| 4 | 0.741808 | 0.689842 | 0.0891679 |
| 5 | 0.408923 | 0.0685223 | 0.317061 |
| 6 | 0.678908 | 0.064342 | 0.219736 |
| 7 | 0.600087 | 0.369806 | 0.632653 |
| 8 | 0.944992 | 0.552085 | 0.31689 |
| 9 | 0.183584 | 0.102664 | 0.545828 |
| 10 | 0.391229 | 0.839631 | 0.00644447 |
| 11 | 0.317618 | 0.288042 | 0.796232 |
| 12 | 0.789993 | 0.938448 | 0.568106 |
| 13 | 0.0615843 | 0.704498 | 0.0554465 |
| 14 | 0.172264 | 0.615129 | 0.633329 |
| 15 | 0.162544 | 0.439882 | 0.0185174 |
| 16 | 0.48592 | 0.280436 | 0.550733 |
| 17 | 0.0370098 | 0.790943 | 0.592646 |
| 18 | 0.371475 | 0.976977 | 0.460522 |
| 19 | 0.493215 | 0.381539 | 0.995716 |
Now, if you want to use this kind of data for time series prediction with a RNN
model, you usually consider one row in the data frame as one timestep. Converting the dataframe
into an array might also help you understand what the timesteps are:
print(np.expand_dims(X.to_numpy(), axis=1).shape)
# (20, 1, 3)
First, I obtain an array of the shape(20, 3)
or in other words, 20 samples and each sample has three features. I then explicitly add a time dimension to the array, resulting in the shape(20, 1, 3)
, meaning that the data set consists of 20 samples and each sample has one time step and for each time step you have 3 features. Now, you can use this data directly as input for a RNN
.