Search code examples
pythonpandasdataframesample

Sample dataframe by value in column and keep all rows


I want to sample a Pandas dataframe using values in a certain column, but I want to keep all rows with values that are in the sample.

For example, in the dataframe below I want to randomly sample some fraction of the values in b, but keep all corresponding rows in a and c.

d = pd.DataFrame({'a': range(1, 101, 1),'b': list(range(0, 100, 4))*4, 'c' :list(range(0, 100, 2))*2} )

Desired example output from a 16% sample:

Out[66]: 
     a   b   c
0    1   0   0
1   26   0  50
2   51   0   0
3   76   0  50
4    4  12   6
5   29  12  56
6   54  12   6
7   79  12  56
8   18  68  34
9   43  68  84
10  68  68  34
11  93  68  84
12  19  72  36
13  44  72  86
14  69  72  36
15  94  72  86

I've tried sampling the series and merging back to the main data, like this:

In [66]: pd.merge(d, d.b.sample(int(.16 * d.b.nunique())))

This creates the desired output, but it seems inefficient. My real dataset has millions of values in b and hundreds of millions of rows. I know I could also use some version of ``isin```, but that also is slow.

Is there a more efficient way to do this?


Solution

  • I really doubt that isin is slow:

    uniques = df.b.unique()
    
    # this maybe the bottle neck
    samples = np.random.choice(uniques, replace=False, size=int(0.16*len(uniques)) )
    
    # sampling here
    df[df.b.isin(samples)]
    

    You can profile the steps above. In case samples=... is slow, you can try:

    idx = np.random.rand(len(uniques))
    samples = uniques[idx<0.16]
    

    Those took about 100 ms on my system on 10 million rows.

    Note: d.b.sample(int(.16 * d.b.nunique())) does not sample 0.16 of the unique values in b.