Search code examples
pythonpython-3.xpandasdataframedata-analysis

Process a pandas DataFrame in Chunks or optimize memory usage for aggregation features


I'm working on a Jupyter notebook in Python with Pandas for data analysis.

When creating new aggregation features, I encounter a "MemoryError" due to the memory capacity of my system.

Before this error occurs, I'm performing operations similar to the following code:

# Sum Operations
operation_fields = [column for column in data.columns if 'Operations' in column]
data['Operations'] = data[operation_fields].apply(lambda row: sum(row), axis=1)
for operation_field in operation_fields:
    data[f'Operations_{operation_field}_vs_total'] = np.where(
        data['Operations'] == 0,
        np.floor(data['Operations']),
        np.floor(data[operation_field] / data['Operations'] * 100)
    )

This results in the data having the following dimensions:

  • Size: 17741115
  • Columns: 85
  • Rows: 208719

After that, I attempt to execute the following code to calculate new features based on transactions:

# Sum transactions
transaction_fields = [column for column in data.columns if 'Transactions' in column and 'SavingAccount' in column]
data['Transactions'] = data[transaction_fields].apply(lambda row: sum(row), axis=1)
for transaction_field in transaction_fields:
    data[f'Transactions_{transaction_fields}_vs_total'] = np.where(
        data['Transactions'] == 0,
        np.floor(data['Transactions']),
        np.floor(data[transaction_fields] / data['Transactions'] * 100)
    )

However, I encounter the error message: "MemoryError: Unable to allocate 325. GiB for an array with shape (208719, 208719) and data type float64."

I'm looking for guidance on how to process this large dataset efficiently.

Options:

  • A way to process this dataset in smaller "chunks" to avoid memory errors.
  • Strategies to optimize memory usage when working with large Pandas DataFrames for aggregation features.

Solution

  • Never heard about vectorization?

    # vectorized sum, which is more memory efficient
    data['Transactions'] = data[transaction_fields].sum(axis=1)
    
    # numpy to avoid the explicit loop
    transaction_ratios = data[transaction_fields].div(data['Transactions'], axis=0)
    transaction_ratios = transaction_ratios.multiply(100).fillna(0).astype(int)
    
    for transaction_field in transaction_fields:
        data[f'Transactions_{transaction_field}_vs_total'] = transaction_ratios[transaction_field]
    
    

    This should help. Alternatively you can explore Polars which is rapidly gaining traction in the data ecosystem.