a similar question to an unresolved SO question (Can one perform a left join in pandas that selects only the first match on the right?), but slightly more complex and with no obvious workaround. I am hoping that there may be some fresh functionality or inspiration available.
I have two dataframes and I want to merge in a one-to-one fashion from the top working down, rather than a Cartesian manner. For example:
l1=['a','b','c','c']
l2=['001','002','003','004']
l3=['a','a','b','b','c','c','c','c']
l4=['006','007','008','009','010','011','012','013']
dfa=pd.DataFrame(list(zip(l1,l2)),columns=['A1','A2'])
dfb=pd.DataFrame(list(zip(l3,l4)),columns=['B1','B2'])
dfa
Out[968]:
A1 A2
0 a 001
1 b 002
2 c 003
3 c 004
dfb
Out[969]:
B1 B2
0 a 006
1 a 007
2 b 008
3 b 009
4 c 010
5 c 011
6 c 012
7 c 013
My desired outcome once I 'merge' on columns A1 and B1 is that I only end up with 4 rows, with the first instances of each A1 / B1 value joining with each other, and the second instances of each A1 / B1 value joining with each other, and so on :-
A1 A2 B2
0 a 001 006
1 b 002 008
2 c 003 010
3 c 004 011
I can't left-merge then groupby and take the first() / head(1), because c in A1 isn't unique. Neither can I drop_duplicates() on B1 in dfb, for the same reason. Beyond that, aside from something complicated using loops, I am out of ideas.
Has anyone got any profound insights they can share?
Let's try to create a 'key' column with groupby()
+cumcount()
to track position then merge on A1,B1 and key:
dfa['key']=dfa.groupby(['A1']).cumcount()
dfb['key']=dfb.groupby(['B1']).cumcount()
df=dfa.merge(dfb,left_on=['A1','key'],right_on=['B1','key']).drop(['B1','key'],1)
output of df
:
A1 A2 B2
0 a 001 006
1 b 002 008
2 c 003 010
3 c 004 011