Search code examples
pythonpandasrandomuniquesample

how to group by and sample unique rows that are not repeated in other groups


My dataframe contains three columns: ID1, ID2, count. For each ID1 value, there are multiple ID2 values sharing the same count number. My goal is to randomly select a unique ID2 value for ID1, but because some ID1 have the same count, so they select from the same group of ID2, and each ID1 thus does not match a unique ID2. How can I remove the previously selected ID2?

df

index   ID1     cnt ID2
96333   48740   142 14145
96334   48740   142 23140
96335   48740   142 26850
96336   48740   142 159025
96349   148240  142 14145
96350   148240  142 23140
96351   148240  142 26850
96352   148240  142 159025

my current code df.groupby("ID1").sample(n = 1, random_state = 5) created

index   ID1     cnt ID2
96333   48740   142 14145
96349   148240  142 14145

expected df

index   ID1     cnt ID2
96333   48740   142 14145
96352   148240  142 159025

One solution is that looping through the dataframe, remove the selected rows and using the updated df for the next sampling, but it will take a long time if the df is large.

Thanks!


Solution

  • You can pivot on ID1/ID2 and get the diagonals. This will ensure unique combinations.

    Then apply per cnt if needed:

    def unique(df):
        idx = np.diagonal(df.pivot('ID1', 'ID2', 'index'))
        return df[df['index'].isin(idx)]
    
    out = unique(df)
    
    # or per group
    out = df.groupby('cnt', group_keys=False).apply(unique)
    

    Output:

       index     ID1  cnt    ID2
    0  96333   48740  142  14145
    5  96350  148240  142  23140
    

    NB. with the above function, the output is deterministic as the pivot sorts the columns. If you want a random output, shuffle the columns before getting the diagonal.

    def unique(df):
        idx = np.diagonal(df.pivot('ID1', 'ID2', 'index')
                            .sample(frac=1, axis=1))
        return df[df['index'].isin(idx)]