Suppose I have a Python dataframe:
A
B
C
A
B
...and a second dataframe
A 3
A 2
A 4
B 5
B 2
B 8
B 7
C 1
C 5
I want to join the second dataframe to the first - but for each value in the first frame, the join should be a random selection from the second row of the second dataframe picking only from where the first column is the same value.
So, for example, for the first value A in the first dataframe, I'd look in the second table and it would pick randomly from the values in the 2nd row whose first row value is an A - i.e. randomly select one of 3, 2 or 4. For the second value B, I'd pick randomly from 5,2,8 or 7. The end result I'd simply want a dataframe like:
A 2
B 8
C 1
B 7
A 4
Let's say this is what you're starting with:
df1
label
0 A
1 B
2 C
3 A
4 B
df2
label value
0 C 5
1 B 8
2 C 1
3 B 2
4 A 3
5 A 4
6 B 5
7 A 2
8 B 7
Option 1: Merge on cumcounted key
One easy way to do this is to shuffle df2
, add an incremental key to both dataFrames and then merge:
df3 = df1.assign(key=df1.groupby('label').cumcount())
df4 = (df2.sample(frac=1)
.reset_index(drop=True)
.assign(key=lambda d: d.groupby('label').cumcount()))
df3.merge(df4, how='left', on=['label', 'key']).drop('key', 1)
label value
0 A 2
1 B 5
2 C 1
3 A 3
4 B 8
note: set np.random.seed
for deterministic shuffles
Option 2: Sample groups and concat
Another option is to groupby df2, sample groups and concat
counts = df1['label'].value_counts()
pd.concat([g.sample(n=counts[k]) for k, g in df2.groupby('label')])
label value
7 A 2
5 A 4
3 B 2
6 B 5
2 C 1
caveat here is that ordering isn't preserved.