I am trying to merge 2 dataframes and for some reason it blows out of proportion in memory. those 2 dataframes are relatively large but nothing out of the ordinary. I am using a strong machine with 128GB of RAM.
x = b.merge(a,left_on='id1',right_on='id',how='left')
This is the output:
MemoryError: Unable to allocate 1.58 TiB for an array with shape (216639452968,) and data type int64
I am probably doing something wrong here but can't understand why it gets to a 1.6 TB of memory requirment.
Here is some info on the dataframes:
print(a.info())
print(a.memory_usage(deep=True))
print(b.info())
print(b.memory_usage(deep=True))
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 10092079 entries, 0 to 10092078
Data columns (total 2 columns):
# Column Dtype
--- ------ -----
0 id object
1 date datetime64[ns]
dtypes: datetime64[ns](1), object(1)
memory usage: 154.0+ MB
None
Index 128
id 654665935
date 80736632
dtype: int64
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 15000000 entries, 0 to 14999999
Data columns (total 2 columns):
# Column Dtype
--- ------ -----
0 id1 object
1 id2 object
dtypes: object(2)
memory usage: 228.9+ MB
None
Index 128
id1 965676606
id2 718661312
dtype: int64
A possible option would be that there are multiple rows with the same id in each data (a and b). If you have in "a" the id "123" 10 times and the same id "123" in "b" 5 times- the resulting dataframe would have 50 rows with id "123".
Make sure you don't have duplicate ids in the datasets. If you do- make sure you really need the duplicates (or maybe groupby(id).agg(...) to remove the duplicates in some way you'd like).
If my solution doesn't assist- please add some more information about the amount of unique values of the id column of each dataset.