Search code examples
python-3.xpandasrandomconcatenation

select random samples within group according to sample size


I have a very large dataset of which the structure is similar to this:

df = pd.DataFrame({
    'ID': [1, 2, 3, 4, 5, 6, 7, 8, 9, 10],
    'Group': [1, 1, 1, 1, 1, 1, 2, 2, 2, 2],
    'SampleSize': [4, 4, 4, 4, 4, 4, 1, 1, 1, 1]
})

Meaning that for example, within Group 1, there are 6 different units to chose from (IDs), and for this Group 1, 4 units need to be chosen to form a sample. So eventually I would like to get an extra column that indicates the samples that are randomly chosen like this:

df = pd.DataFrame({
    'ID': [1, 2, 3, 4, 5, 6, 7, 8, 9, 10],
    'Group': [1, 1, 1, 1, 1, 1, 2, 2, 2, 2],
    'SampleSize': [4, 4, 4, 4, 4, 4, 1, 1, 1, 1]
    'Sample': [0, 1, 1, 1, 0, 1, 0, 1, 0, 0]
})

I tried something like this:

def select_random_ids(group):
    sample_size = group['SampleSize'].iloc[0]
    selected_ids = np.random.choice(group['ID'], size=sample_size, replace=False)
    return pd.DataFrame({'ID': selected_ids})

and the with .apply(select_random_ids)) but I can't get it to work..


Solution

  • Shuffle the input with sample then compute a cumcount and compare to SampleSize to identify the desired flags:

    df['Sample'] = df['SampleSize'].gt(df.sample(frac=1).groupby('Group').cumcount()).astype(int)
    

    Output:

       ID  Group  SampleSize  Sample
    0   1      1           4       0
    1   2      1           4       1
    2   3      1           4       0
    3   4      1           4       1
    4   5      1           4       1
    5   6      1           4       1
    6   7      2           1       0
    7   8      2           1       0
    8   9      2           1       0
    9  10      2           1       1