I want to sample a Pandas dataframe using values in a certain column, but I want to keep all rows with values that are in the sample.
For example, in the dataframe below I want to randomly sample some fraction of the values in b
, but keep all corresponding rows in a
and c
.
d = pd.DataFrame({'a': range(1, 101, 1),'b': list(range(0, 100, 4))*4, 'c' :list(range(0, 100, 2))*2} )
Desired example output from a 16% sample:
Out[66]:
a b c
0 1 0 0
1 26 0 50
2 51 0 0
3 76 0 50
4 4 12 6
5 29 12 56
6 54 12 6
7 79 12 56
8 18 68 34
9 43 68 84
10 68 68 34
11 93 68 84
12 19 72 36
13 44 72 86
14 69 72 36
15 94 72 86
I've tried sampling the series and merging back to the main data, like this:
In [66]: pd.merge(d, d.b.sample(int(.16 * d.b.nunique())))
This creates the desired output, but it seems inefficient. My real dataset has millions of values in b
and hundreds of millions of rows. I know I could also use some version of ``isin```, but that also is slow.
Is there a more efficient way to do this?
I really doubt that isin
is slow:
uniques = df.b.unique()
# this maybe the bottle neck
samples = np.random.choice(uniques, replace=False, size=int(0.16*len(uniques)) )
# sampling here
df[df.b.isin(samples)]
You can profile the steps above. In case samples=...
is slow, you can try:
idx = np.random.rand(len(uniques))
samples = uniques[idx<0.16]
Those took about 100 ms on my system on 10 million rows.
Note: d.b.sample(int(.16 * d.b.nunique()))
does not sample 0.16
of the unique values in b
.