I'm working on a Jupyter notebook in Python with Pandas for data analysis.
When creating new aggregation features, I encounter a "MemoryError" due to the memory capacity of my system.
Before this error occurs, I'm performing operations similar to the following code:
# Sum Operations
operation_fields = [column for column in data.columns if 'Operations' in column]
data['Operations'] = data[operation_fields].apply(lambda row: sum(row), axis=1)
for operation_field in operation_fields:
data[f'Operations_{operation_field}_vs_total'] = np.where(
data['Operations'] == 0,
np.floor(data['Operations']),
np.floor(data[operation_field] / data['Operations'] * 100)
)
This results in the data having the following dimensions:
After that, I attempt to execute the following code to calculate new features based on transactions:
# Sum transactions
transaction_fields = [column for column in data.columns if 'Transactions' in column and 'SavingAccount' in column]
data['Transactions'] = data[transaction_fields].apply(lambda row: sum(row), axis=1)
for transaction_field in transaction_fields:
data[f'Transactions_{transaction_fields}_vs_total'] = np.where(
data['Transactions'] == 0,
np.floor(data['Transactions']),
np.floor(data[transaction_fields] / data['Transactions'] * 100)
)
However, I encounter the error message: "MemoryError: Unable to allocate 325. GiB for an array with shape (208719, 208719) and data type float64."
I'm looking for guidance on how to process this large dataset efficiently.
Options:
Never heard about vectorization?
# vectorized sum, which is more memory efficient
data['Transactions'] = data[transaction_fields].sum(axis=1)
# numpy to avoid the explicit loop
transaction_ratios = data[transaction_fields].div(data['Transactions'], axis=0)
transaction_ratios = transaction_ratios.multiply(100).fillna(0).astype(int)
for transaction_field in transaction_fields:
data[f'Transactions_{transaction_field}_vs_total'] = transaction_ratios[transaction_field]
This should help. Alternatively you can explore Polars which is rapidly gaining traction in the data ecosystem.