Search code examples
pythonpandasdataframepermutationshuffle

Shuffle DataFrame rows


I have the following DataFrame:

    Col1  Col2  Col3  Type
0      1     2     3     1
1      4     5     6     1
...
20     7     8     9     2
21    10    11    12     2
...
45    13    14    15     3
46    16    17    18     3
...

The DataFrame is read from a CSV file. All rows which have Type 1 are on top, followed by the rows with Type 2, followed by the rows with Type 3, etc.

I would like to shuffle the order of the DataFrame's rows so that all Type's are mixed. A possible result could be:

    Col1  Col2  Col3  Type
0      7     8     9     2
1     13    14    15     3
...
20     1     2     3     1
21    10    11    12     2
...
45     4     5     6     1
46    16    17    18     3
...

How can I achieve this?


Solution

  • The idiomatic way to do this with Pandas is to use the .sample method of your data frame to sample all rows without replacement:

    df.sample(frac=1)
    

    The frac keyword argument specifies the fraction of rows to return in the random sample, so frac=1 means to return all rows (in random order).


    Note: If you wish to shuffle your dataframe in-place and reset the index, you could do e.g.

    df = df.sample(frac=1).reset_index(drop=True)
    

    Here, specifying drop=True prevents .reset_index from creating a column containing the old index entries.

    Follow-up note: Although it may not look like the above operation is in-place, python/pandas is smart enough not to do another malloc for the shuffled object. That is, even though the reference object has changed (by which I mean id(df_old) is not the same as id(df_new)), the underlying C object is still the same. To show that this is indeed the case, you could run a simple memory profiler:

    $ python3 -m memory_profiler .\test.py
    Filename: .\test.py
    
    Line #    Mem usage    Increment   Line Contents
    ================================================
         5     68.5 MiB     68.5 MiB   @profile
         6                             def shuffle():
         7    847.8 MiB    779.3 MiB       df = pd.DataFrame(np.random.randn(100, 1000000))
         8    847.9 MiB      0.1 MiB       df = df.sample(frac=1).reset_index(drop=True)