Search code examples
pythonpandasdataframedrop

Using df.drop(idx) results to Errorcode: Out of Memory


My original dataframe df_clean has 536000+ records and I want delete records based on certain conditions using grouppby and filter. Here is the code:

df_pairs = df_clean.groupby([df_clean.CustomerID, df_clean.StockCode, df_clean.Quantity.abs()]).filter(lambda x: (len(x.Quantity.abs()) % 2 == 0) and (x.Quantity.sum() == 0))

len(df_pairs) is 4016

Then I took the index:

 idx=df_pairs.index

And use drop function:

df_clean.drop(idx)

But this last code of dropping took too much time and in the end it's like it crashed and gave a whitepage showing Aw, Snap! Something went wrong while displaying this webpage. Error code: Out of Memory. enter image description here

I have already tried reloading, shutting down my kernel and restarting my computer but I still get the same white page.

I also tried an alternative way using .loc and ~

df_clean = df_clean.loc[~((df_clean.groupby([df_clean.CustomerID, df_clean.StockCode, df_clean.Quantity.abs()]).filter(lambda x: (len(x.Quantity.abs()) % 2 == 0) and (x.Quantity.sum() == 0))))]

But it gives me an error:

TypeError                                 Traceback (most recent call last)
C:\Users\MARTIN~1\AppData\Local\Temp/ipykernel_7792/227912236.py in <module>
----> 1 df_clean = df_clean.loc[~((df_clean.groupby([df_clean.CustomerID, df_clean.StockCode, df_clean.Quantity.abs()]).filter(lambda x: (len(x.Quantity.abs()) % 2 == 0) and (x.Quantity.sum() == 0))))]

~\anaconda3\lib\site-packages\pandas\core\generic.py in __invert__(self)
   1530             return self
   1531 
-> 1532         new_data = self._mgr.apply(operator.invert)
   1533         return self._constructor(new_data).__finalize__(self, method="__invert__")
   1534 

~\anaconda3\lib\site-packages\pandas\core\internals\managers.py in apply(self, f, align_keys, ignore_failures, **kwargs)
    323             try:
    324                 if callable(f):
--> 325                     applied = b.apply(f, **kwargs)
    326                 else:
    327                     applied = getattr(b, f)(**kwargs)

~\anaconda3\lib\site-packages\pandas\core\internals\blocks.py in apply(self, func, **kwargs)
    379         """
    380         with np.errstate(all="ignore"):
--> 381             result = func(self.values, **kwargs)
    382 
    383         return self._split_op_result(result)

TypeError: bad operand type for unary ~: 'DatetimeArray'

Please advise other alternative ways on how I can remove the records I filtered (stored in df_pairs). Any ideas or solutions would be appreciated.

Note: I cannot use isin() or pd.concat then drop_duplicates() because my dataset is a sales transactions history where each record is a line in an invoice. Something like this:

InvoiceNo StockCode Description Quantity InvoiceDate UnitPrice CustomerID TotalSales
536365 85123A WHITE HANGING HEART T-LIGHT HOLDER 6 2018-11-29 08:26:00 2.55 17850 15.30
536365 71053 WHITE METAL LANTERN 6 2018-11-29 08:26:00 3.39 17850 20.34
536365 84406B CREAM CUPID HEARTS COAT HANGER 8 2018-11-29 08:26:00 2.75 17850 22.00
536365 84029G KNITTED UNION FLAG HOT WATER BOTTLE 6 2018-11-29 08:26:00 3.39 17850 20.34
536365 84029E RED WOOLLY HOTTIE WHITE HEART. 6 2018-11-29 08:26:00 3.39 17850 20.34

Solution

  • Using drop like that will return ANOTHER dataframe without those rows. You may want to try operating on the original dataframe so that a new one isn't made.

    Instead of:

    df = df.drop(idxs)
    

    do:

    df.drop(idxs, inplace=True)
    

    You're nearly doubling the memory needed until the garbage collector claims the original.