Search code examples
pythonpandasdataframekerneljupyter-lab

Pandas: Kernel unexpectedly dies after attempting join() two large dataframes


I am trying to join two datasets that share the same index by using:

merged_data = df1.join(df2)

However, the kernel keeps dying. I've attempted restarting my notebook (jupyter lab) but I think it's related to one of the data frames being about 2GB...

About df1

<class 'pandas.core.frame.DataFrame'>
Index: 97812 entries, XXXX to XXXX
Data columns (total 19 columns):
dtypes: float64(2), int64(3), object(14)
memory usage: 14.9+ MB

about df2

<class 'pandas.core.frame.DataFrame'>
Index: 13888745 entries, XXXX to XXXX
Data columns (total 18 columns):
dtypes: int64(16), object(2)
memory usage: 2.0+ GB

How can I make this work?

I do need all the entries and columns. The dataframes don't share columns in common besides the index.

If it is worth knowing... I am using a MacBook Pro (Early 2015) with 2.7 GHz Dual-Core Intel Core i5 (processor) and 8 GB 1867 MHz DDR3 (memory)


Solution

  • If the issue is indeed due to your laptop running out of memory, you could try to use something like dask.

    You can convert your pandas dataframes into dask dataframes using dask.dataframe.from_pandas. Then use the .join method of the dask dataframes, like regular pandas.