Search code examples
pythontensorflowkeraslstm

Merge or append multiple Keras TimeseriesGenerator objects into one


I'm trying to make a LSTM model. The data is coming from a csv file that contains values for multiple stocks.

I can't use all the rows as they appear in the file to make sequences because each sequence is only relevant in the context of its own stock, so I need to select the rows for each stock and make the sequences based on that.

I have something like this:

for stock in stocks:

    stock_df = df.loc[(df['symbol'] == stock)].copy()
    target = stock_df.pop('price')

    x = np.array(stock_df.values)
    y = np.array(target.values)

    sequence = TimeseriesGenerator(x, y, length = 4, sampling_rate = 1, batch_size = 1)

That works fine, but then I want to merge each of those sequences into a bigger one that I will use for training and that contains the data for all the stocks.

It is not possible to use append or merge because the function return a generator object, not a numpy array.


Solution

  • EDIT: New answer:


    So what I've ended up doing is to do all the preprocessing manually and save an .npy file for each stock containing the preprocessed sequences, then using a manually created generator I make batches like this:

    class seq_generator():
    
      def __init__(self, list_of_filepaths):
        self.usedDict = dict()
        for path in list_of_filepaths:
          self.usedDict[path] = []
    
      def generate(self):
        while True: 
          path = np.random.choice(list(self.usedDict.keys()))
          stock_array = np.load(path) 
          random_sequence = np.random.randint(stock_array.shape[0])
          if random_sequence not in self.usedDict[path]:
            self.usedDict[path].append(random_sequence)
            yield stock_array[random_sequence, :, :]
    
    train_generator = seq_generator(list_of_filepaths)
    
    train_dataset = tf.data.Dataset.from_generator(seq_generator.generate(),
                                                   output_types=(tf.float32, tf.float32), 
                                                   output_shapes=(n_timesteps, n_features)) 
    
    train_dataset = train_dataset.batch(batch_size)
    

    Where list_of_filepaths is simply a list of paths to preprocessed .npy data.


    This will:

    • Load a random stock's preprocessed .npy data
    • Pick a sequence at random
    • Check if the index of the sequence has already been used in usedDict
    • If not:
      • Append the index of that sequence to usedDict to keep track as to not feed the same data twice to the model
      • Yield the sequence

    This means that the generator will feed a single unique sequence from a random stock at each "call", enabling me to use the .from_generator() and .batch() methods from Tensorflows Dataset type.


    Original answer:

    I think the answer from @TF_Support is slightly missing the point. If I understand your question It's not as if you want to train one model pr. stock, you want one model trained on the entire dataset.

    If you have enough memory you could manually create the sequences and hold the entire dataset in memory. The issue I'm facing is similar, I simply can't hold everything in memory: Creating a TimeseriesGenerator with multiple inputs.

    Instead I'm exploring the possibility of preprocessing all data for each stock seperately, saving as .npy files and then using a generator to load a random sample of those .npy files to batch data to the model, I'm not entirely sure how to approach this yet though.