Search code examples
pythonpandassubset

Pandas creating new vs. overwriting existing dataframe


I am working with a ~70 GB data frame consisting of around 7 million rows and 56 columns. I want to subset this dataframe to a smaller one, taking 100.000 random rows out of the original dataframe.

While doing so, I observed very strange behavior: df is my 7 million rows dataframe which I read into python as a .parquet file.

I first tried the following:

import pandas as pd
df = df.sample(100000)

However, exeuting this chunk takes forever. I always interrupted the command after ten minutes, because I am sure drawing random rows from a dataframe can't take that long

Now if I execute the following chunk, the code runs through in just a few seconds:

import pandas as pd
df2 = df
df = df2.sample(100000)

What is happening there? Why is .sample() taking forever in the first try, and executing in just a few seconds in the second try? How could copying the dataframe affect the speed of the computation? dfand df2 should be exactly the same objects, right? I could of course now just continue working with df2, but I don't want two 70 GB files to be stored in memory.


Solution

  • Here is what I believe is happening.

    In this code

    df = df.sample(100000)
    

    when you assign the sample to the same name as the original dataframe, the ref count to the original dataframe drops to zero causing it to be garbage collected. Once the dataframe is garbage collected, any Python objects your dataframe contained (other than the 100k you have sampled) also get garbage collected. With a 7 million row dataframe this could take a while.

    In this code

    df2 = df
    df = df2.sample(100000)
    

    after assigning the sample to df, the original dataframe is still referenced by the name df2 which avoids garbage collection.

    The way to verify this is to change your second version to

    df2 = df
    df = df2.sample(100000)
    del df2
    

    Doing del df2 will remove the name df2 causing the reference count to drop to zero. You should now see this version take as long as your original code.