Search code examples
pandasdataframesampling

Pandas: Randomly sample 5 consecutive rows based on a condition (value in another column)


For my data, I want to sample 5 lots of 5 consecutive days. For each 'group' of 5-day samples, I want the value in another column to be the same. My data is a time series. Here's a sample:

enter image description here

Previously, when I was happy with non-consecutive days, I'd use the following code:

df.groupby("AGENT").sample(n=5, random_state=1, replace = True)

I want it to be random, so I don't just want to take the index for the first new agent and then the subsequent 4 rows.


Solution

  • One option is to use a custom groupby.apply:

    import numpy as np
    
    n = 5
    out = (df.groupby('Agent', group_keys=False)
             .apply(lambda g: g.iloc[(x:=np.random.randint(0, len(g)-n)): x+n])
          )
    

    If you have python < 3.8:

    import numpy as np
    
    def random_consecutives(g, n):
        start = np.random.randint(0, len(g)-n)
        return g.iloc[start: start+n]
    
    out = (df.groupby('Agent', group_keys=False)
             .apply(random_consecutives, n=5)
          )
    

    Example output:

        Agent  Sales (k)        Date
    2       1        1.2  21/08/2012
    3       1        6.7  22/08/2012
    4       1        5.8  23/08/2012
    5       1        9.3  24/08/2012
    6       1        8.3  25/08/2012
    12      2        8.0  06/07/2012
    13      2        0.9  07/07/2012
    14      2        1.3  08/07/2012
    15      2        1.6  09/07/2012
    16      2        8.9  10/07/2012