Search code examples
pythonpandasjoinrandom

Pick random values from a second table based on join in Python / Pandas


Suppose I have a Python dataframe:

A
B
C
A
B

...and a second dataframe

A  3
A  2
A  4
B  5
B  2
B  8
B  7
C  1
C  5

I want to join the second dataframe to the first - but for each value in the first frame, the join should be a random selection from the second row of the second dataframe picking only from where the first column is the same value.

So, for example, for the first value A in the first dataframe, I'd look in the second table and it would pick randomly from the values in the 2nd row whose first row value is an A - i.e. randomly select one of 3, 2 or 4. For the second value B, I'd pick randomly from 5,2,8 or 7. The end result I'd simply want a dataframe like:

A  2
B  8
C  1
B  7
A  4

Solution

  • Let's say this is what you're starting with:

    df1
    
      label
    0     A
    1     B
    2     C
    3     A
    4     B
    
    df2
    
      label  value
    0     C      5
    1     B      8
    2     C      1
    3     B      2
    4     A      3
    5     A      4
    6     B      5
    7     A      2
    8     B      7
    

    Option 1: Merge on cumcounted key
    One easy way to do this is to shuffle df2, add an incremental key to both dataFrames and then merge:

    df3 = df1.assign(key=df1.groupby('label').cumcount())
    df4 = (df2.sample(frac=1)
              .reset_index(drop=True)
              .assign(key=lambda d: d.groupby('label').cumcount()))
    
    df3.merge(df4, how='left', on=['label', 'key']).drop('key', 1)
    
      label  value
    0     A      2
    1     B      5
    2     C      1
    3     A      3
    4     B      8
    

    note: set np.random.seed for deterministic shuffles


    Option 2: Sample groups and concat
    Another option is to groupby df2, sample groups and concat

    counts = df1['label'].value_counts() 
    pd.concat([g.sample(n=counts[k]) for k, g in df2.groupby('label')])
    
      label  value
    7     A      2
    5     A      4
    3     B      2
    6     B      5
    2     C      1
    

    caveat here is that ordering isn't preserved.