Search code examples
pythonpandassampling

Random sampling pandas based on column values


I have files (A,B,C etc) each having 12,000 data points. I have divided the files into batches of 1000 points and computed the value for each batch. So now for each file we have 12 values, which is loaded in a pandas Data Frame (shown below).

    file    value_1     value_2
0   A           1           43
1   A           1           89
2   A           1           22
3   A           1           87
4   A           1           43
5   A           1           89
6   A           1           22
7   A           1           87
8   A           1           43
9   A           1           89
10  A           1           22
11  A           1           87
12  A           1           83
13  B           0           99
14  B           0           23
15  B           0           29
16  B           0           34
17  B           0           99
18  B           0           23
19  B           0           29
20  B           0           34
21  B           0           99
22  B           0           23
23  B           0           29
24  B           0           34
25  C           1           62
-   -           -           -
-   -           -           -

Now as the next step I need to randomly select a file, and for that file randomly select a sequence of 4 batches for value_1. The later, I believe can be done with df.sample(), but I'm not sure how to randomly select the files. I tried to make it work with np.random.choice(data['file'].unique()), but doesn't seems correct.

Thanks for the help in advance. I'm pretty new to pandas and python in general.


Solution

  • If I understand what you are trying to get at, the following should be of help:

    # Test dataframe
    import numpy as np
    import pandas as pd
    
    
    data = pd.DataFrame({'file': np.repeat(['A', 'B', 'C'], 12),
                         'value_1': np.repeat([1,0,1],12),
                         'value_2': np.random.randint(20, 100, 36)})
    # Select a file
    data1 = data[data.file == np.random.choice(data['file'].unique())].reset_index(drop=True)
    
    # Get a random index from data1
    start_ix = np.random.choice(data1.index[:-3])
    
    # Get a sequence starting at the random index from the previous step
    print(data.loc[start_ix:start_ix+3])