Search code examples
pythonpandasdataframetensorflowtensorflow-datasets

Speed up pandas iteration with multiple excel sheets


I have an excel file with 1000 sheets and each sheet contain a dataframe. In order to feed my model with these data, I try to convert it to 1000 batches of tensors and here's my code:

df = pd.read_excel('file.xlsx', sheet_name=None)
file_names = list(df.keys())

columns = ['A','B','C']
features = []
labels = []
for n in file_names:
  df = pd.read_excel('file.xlsx', sheet_name=n)
  features.append(df[columns].to_numpy())
  labels.append(df['D'].to_numpy())
  
Y = tf.convert_to_tensor(np.stack(labels), dtype=tf.float32)
X = tf.convert_to_tensor(np.stack(features), dtype=tf.float32)
dataset = tf.data.Dataset.from_tensor_slices((X, Y))

My code works fine, but it takes over an hour to iterate it. I will have more than 1000 batches of data in the future so it seems not a good idea to have several thousand of csv files. How can I speed up the process?


Solution

  • You could retrieve your file.xlsx once, which will read all worksheets into a dictionary of dataframes and then you can get the sheets from that dictionary:

    import tensorflow as tf
    import pandas as pd
    import numpy as np
    from random import sample
    
    ### Create data
    writer = pd.ExcelWriter('file.xlsx', engine='xlsxwriter')
    for i in range(1000):
      df = pd.DataFrame({'A': [1, i, 1, 2, 9], 'B': [3, 4, i, 1, 4], 'C': [3, 4, 3, i, 4], 'D': [1, 2, 6, 1, 4], 'E': [0, 1, 1, 0, 1]})
      df.to_excel(writer, sheet_name='Sheet'+ str(i))
    
    writer.save()
    
    df = pd.read_excel('file.xlsx', sheet_name=None)
    file_names = list(df.keys())
    
    columns = ['A','B','C']
    features = []
    labels = []
    for n in file_names:
      temp_df = df[n]
      features.append(temp_df[columns].to_numpy())
      labels.append(temp_df['D'].to_numpy())
      
    Y = tf.convert_to_tensor(np.stack(labels), dtype=tf.float32)
    X = tf.convert_to_tensor(np.stack(features), dtype=tf.float32)
    dataset = tf.data.Dataset.from_tensor_slices((X, Y))
    

    Furthermore, you could try creating your own custom data generator and retrieve random samples from your Excel file, which should also speed things up:

    df = pd.read_excel('file.xlsx', sheet_name=None)
    file_names = list(df.keys())
    columns = ['A','B','C']
    
    def generator_function(samples = 64):
        def generator():
            for n in sample(file_names, samples):
                temp_df = df[n]
                x = temp_df[columns].to_numpy()
                y = temp_df['D'].to_numpy()
                yield x, y
        return generator
    
    gen = generator_function()
    dataset = tf.data.Dataset.from_generator(
        generator=gen,
        output_types=(np.float32, np.int32), 
        output_shapes=((5, 3), (5))
    )
    batch_size = 16
    dataset = dataset.batch(batch_size, drop_remainder=True)
    dataset = dataset.prefetch(tf.data.AUTOTUNE)