My dataframe contains three columns: ID1
, ID2
, count
. For each ID1
value, there are multiple ID2
values sharing the same count number. My goal is to randomly select a unique ID2
value for ID1
, but because some ID1
have the same count, so they select from the same group of ID2
, and each ID1
thus does not match a unique ID2
. How can I remove the previously selected ID2
?
df
index ID1 cnt ID2
96333 48740 142 14145
96334 48740 142 23140
96335 48740 142 26850
96336 48740 142 159025
96349 148240 142 14145
96350 148240 142 23140
96351 148240 142 26850
96352 148240 142 159025
my current code df.groupby("ID1").sample(n = 1, random_state = 5)
created
index ID1 cnt ID2
96333 48740 142 14145
96349 148240 142 14145
expected df
index ID1 cnt ID2
96333 48740 142 14145
96352 148240 142 159025
One solution is that looping through the dataframe, remove the selected rows and using the updated df for the next sampling, but it will take a long time if the df is large.
Thanks!
You can pivot
on ID1/ID2 and get the diagonals. This will ensure unique combinations.
Then apply per cnt if needed:
def unique(df):
idx = np.diagonal(df.pivot('ID1', 'ID2', 'index'))
return df[df['index'].isin(idx)]
out = unique(df)
# or per group
out = df.groupby('cnt', group_keys=False).apply(unique)
Output:
index ID1 cnt ID2
0 96333 48740 142 14145
5 96350 148240 142 23140
NB. with the above function, the output is deterministic as the pivot
sorts the columns. If you want a random output, shuffle the columns before getting the diagonal.
def unique(df):
idx = np.diagonal(df.pivot('ID1', 'ID2', 'index')
.sample(frac=1, axis=1))
return df[df['index'].isin(idx)]