Search code examples
pythonpandasdataframenumpyrecurrent-neural-network

correct way to divide a dataframe (or numpy array) by rows


i'm new to the world of machine learning and i'm studying rnn to classify timeseries. I am studying this dataset https://archive.ics.uci.edu/ml/datasets/EEG+Eye+State# consisting of 14 timeseries with a number of steps equal to 14980 per timeseries what I would like to get is a set of timeseries with exactly 20 timesteps so a numpy array having shape (749,20,14) where 749 is the number of timeseries, 20 is the number of timesteps for timeseries and 14 is the number of values per timestep. This array will then be given inuput to the net for training. What is the right way to achieve this?

starting dataframe, the last column contains integers to classify the timeseries

#how to divide it right?
data = arff.loadarff('./datasets/eeg_eye_state.arff')

df = pd.DataFrame(data[0])
df['eyeDetection'] = df['eyeDetection'].str.decode('utf-8')
df['eyeDetection'] = df['eyeDetection'].astype(str).astype(int)

Solution

  • Since you're using the EEG Eye State data set and:

    All values are in chronological order with the first measured value at the top of the data.

    You could use the TimeseriesGenerator from the tensorflow.keras utility class to generate the batches of temporal data.

    from tensorflow.keras.preprocessing.sequence import TimeseriesGenerator
    
    n_input = 20
    batch_size = 749
    data_input = df.drop(columns=['eyeDetection'])
    
    data_gen = TimeseriesGenerator(data_input, df.eyeDetection, length=n_input, batch_size=batch_size)
    
    batch_0 = data_gen[0]
    x, y = batch_0
    
    print(x.shape)
    print(y.shape)
    
    #feed possibly to a model.fit()
    #model.fit(data_gen, ...)
    
    (749, 20, 14)
    (749,)