Search code examples
pythonpython-3.x

memory issue when merging two data frames


I am stuck at this second last statement clueless. The error is : numpy.core._exceptions.MemoryError: Unable to allocate 58.1 GiB for an array with shape (7791676634,) and data type int64

My thinking was that merging a data frame of ~12 million records with another data frame of 3-4 more columns should not be a big deal. Please help me out. Totally stuck here. Thanks

Select_Emp_df has around 900k records and Big_df has around 12 million records and 9 columns. I just need to merge two DFs like we do vlookup in Excel on key column.

import pandas as pd

Emp_df = pd.read_csv('New_Employee_df.csv', low_memory = False )

# Append data into one data frame from three csv files of 3 years' 
transactions
df2019 = pd.read_csv('U21_02767G - Customer Trade Info2019.csv', 
low_memory = False )
df2021 = pd.read_csv('U21_02767G - Customer Trade 
Info2021(TillSep).csv', low_memory = False)
df2020 = pd.read_csv('Newdf2020.csv', low_memory = False)

Big_df = pd.concat([df2019, df2020, df2021], ignore_index=True)

Select_Emp_df = Emp_df[['CUSTKEY','GCIF_GENDER_DSC','SEX']]

Big_df = pd.merge(Big_df, Select_Emp_df, on='CUSTKEY')
print (Big_df.info)

Solution

  • Just before Big_df = pd.merge(Big_df, Select_Emp_df, on='CUSTKEY') try to delete previous dataframes. Like this.

    del df2019
    del df2020
    del df2021
    

    This should save some memory

    also try

    Select_Emp_df = Emp_df[['CUSTKEY','GCIF_GENDER_DSC','SEX']].drop_duplicates(subset=['CUSTKEY'])