I have a very large dataset of which the structure is similar to this:
df = pd.DataFrame({
'ID': [1, 2, 3, 4, 5, 6, 7, 8, 9, 10],
'Group': [1, 1, 1, 1, 1, 1, 2, 2, 2, 2],
'SampleSize': [4, 4, 4, 4, 4, 4, 1, 1, 1, 1]
})
Meaning that for example, within Group
1
, there are 6 different units to chose from (ID
s), and for this Group
1
, 4 units need to be chosen to form a sample. So eventually I would like to get an extra column that indicates the samples that are randomly chosen like this:
df = pd.DataFrame({
'ID': [1, 2, 3, 4, 5, 6, 7, 8, 9, 10],
'Group': [1, 1, 1, 1, 1, 1, 2, 2, 2, 2],
'SampleSize': [4, 4, 4, 4, 4, 4, 1, 1, 1, 1]
'Sample': [0, 1, 1, 1, 0, 1, 0, 1, 0, 0]
})
I tried something like this:
def select_random_ids(group):
sample_size = group['SampleSize'].iloc[0]
selected_ids = np.random.choice(group['ID'], size=sample_size, replace=False)
return pd.DataFrame({'ID': selected_ids})
and the with .apply(select_random_ids))
but I can't get it to work..
Shuffle the input with sample
then compute a cumcount
and compare to SampleSize to identify the desired flags:
df['Sample'] = df['SampleSize'].gt(df.sample(frac=1).groupby('Group').cumcount()).astype(int)
Output:
ID Group SampleSize Sample
0 1 1 4 0
1 2 1 4 1
2 3 1 4 0
3 4 1 4 1
4 5 1 4 1
5 6 1 4 1
6 7 2 1 0
7 8 2 1 0
8 9 2 1 0
9 10 2 1 1