For my data, I want to sample 5 lots of 5 consecutive days. For each 'group' of 5-day samples, I want the value in another column to be the same. My data is a time series. Here's a sample:
Previously, when I was happy with non-consecutive days, I'd use the following code:
df.groupby("AGENT").sample(n=5, random_state=1, replace = True)
I want it to be random, so I don't just want to take the index for the first new agent and then the subsequent 4 rows.
One option is to use a custom groupby.apply
:
import numpy as np
n = 5
out = (df.groupby('Agent', group_keys=False)
.apply(lambda g: g.iloc[(x:=np.random.randint(0, len(g)-n)): x+n])
)
If you have python < 3.8:
import numpy as np
def random_consecutives(g, n):
start = np.random.randint(0, len(g)-n)
return g.iloc[start: start+n]
out = (df.groupby('Agent', group_keys=False)
.apply(random_consecutives, n=5)
)
Example output:
Agent Sales (k) Date
2 1 1.2 21/08/2012
3 1 6.7 22/08/2012
4 1 5.8 23/08/2012
5 1 9.3 24/08/2012
6 1 8.3 25/08/2012
12 2 8.0 06/07/2012
13 2 0.9 07/07/2012
14 2 1.3 08/07/2012
15 2 1.6 09/07/2012
16 2 8.9 10/07/2012