Search code examples
pythontensorflowkerasdeep-learning

How to efficiently convert a large pandas dataframe containing a nested list into Tensorflow dataset?


I am building a neural network based product recommender system and am trying to convert the dataset into a format that can be used for training a keras model. The model is a binary classifier predicting the probability of a user buying a certain product.

The dataset is a pandas dataframe, containg 3 columns:

  • 'target_product_embeddings' - a list of embeddings of the target product that I want to predict probability of buying for. Embeddings dimension = 1536.
  • 'hist_product_embeddings' - embeddings of products that a user bought in the past. 10 products per users and 1536 embeddings for each product, so dimensions = (10, 1536).
  • 'target' - target variable, 1 if user purchased the product, 0 otherwise.

The dataset contains ~ 1.5mln. records (one row per user and target product). I am trying to use Tensorflow Dataset to create the training dataset but it's taking very long time to run and never finishes. I was wondering if there is a more efficient way to create the training dataset? I have also tried converting the dataframe into np arrays but that used up all memory and caused the kernel to die eventually.

Below is a code example:

# Import libraries
import pandas as pd
import numpy as np
import random
import tensorflow as tf

# Notebook display settings
from IPython.core.interactiveshell import InteractiveShell
InteractiveShell.ast_node_interactivity = "all"

# Generate a dummy dataset
# Define number of users (number of rows in dataframe), my actual dataset is ~1.5mln
number_of_users = 10000
# Product embeddings dimension
embeddings_dim = 1536
# Number of products in user's purchase history
hist_prod_seq_len = 10

df = pd.DataFrame(data = {
    'target_product_embeddings': [[random.uniform(-10, 10) for emb in range(embeddings_dim)] 
                                  for user in range(number_of_users)],
    'hist_product_embeddings': [[[random.uniform(-10, 10) for emb in range(embeddings_dim)] 
                                 for prod in range(hist_prod_seq_len)] for user in range(number_of_users)],
    'target': [np.random.choice([0, 1]) for user in range(number_of_users)]
})

# Inspect data
df.shape
df.head()

# Create tf dataset
train_dataset = tf.data.Dataset.from_tensor_slices((
    {'target_product_embeddings': df['target_product_embeddings'].tolist(),
     'hist_product_embeddings': df['hist_product_embeddings'].tolist()
    },
    df['target']
))

Solution

  • Handling large datasets with complex structures like yours in TensorFlow can be challenging and resource-intensive when not approached optimally. Your dataset's complexity, especially with high-dimensional embeddings for many users, can easily lead to excessive memory usage and slow processing times.

    Below are my ideas to handle this complexity.

    1) Use of Generator Function

    • Instead of loading full dataset into memory at a time, use of generator function can yield one record at a time, which is efficient for large datasets

    2) Use .prefetch() to prepare the next batch while the current one is being processed.

    Based on the statements above, here is the simple code, to the best of my understanding and skills. Please correct it as needed.

    import pandas as pd
    import numpy as np
    import tensorflow as tf
    import random
    # Assuming the DataFrame 'df' is already created and populated as per your 
    example
    
    # Define the generator function
    def dataset_generator(df):
        for _, row in df.iterrows():
            target_product_embeddings = 
    np.array(row['target_product_embeddings'], dtype=np.float32)
            hist_product_embeddings = np.array(row['hist_product_embeddings'], 
    dtype=np.float32)
            target = np.array(row['target'], dtype=np.int64)
            yield {'target_product_embeddings': target_product_embeddings,
                   'hist_product_embeddings': hist_product_embeddings}, target
    # Define output types and shapes for the dataset
    output_signature = (
        {'target_product_embeddings': tf.TensorSpec(shape=(1536,), 
        dtype=tf.float32), 'hist_product_embeddings': tf.TensorSpec(shape=(10, 1536), 
        dtype=tf.float32)}, tf.TensorSpec(shape=(), dtype=tf.int64)
    )
    
    # Create the TensorFlow dataset from the generator
    train_dataset = tf.data.Dataset.from_generator(
        lambda: dataset_generator(df),
        output_signature=output_signature
    )
    
    # Batch, prefetch, and cache the dataset for optimal performance
    train_dataset = train_dataset.batch(32).prefetch(tf.data.AUTOTUNE)
    
    # Example to iterate over the first batch (for demonstration)
    for data, target in train_dataset.take(1):
        print(data['target_product_embeddings'].shape) 
        print(data['hist_product_embeddings'].shape)    
        print(target.shape)