Search code examples
pythonpandasmergedrop-duplicates

Pandas one-to-one row merge, maintaining the structure on the left hand side?


a similar question to an unresolved SO question (Can one perform a left join in pandas that selects only the first match on the right?), but slightly more complex and with no obvious workaround. I am hoping that there may be some fresh functionality or inspiration available.

I have two dataframes and I want to merge in a one-to-one fashion from the top working down, rather than a Cartesian manner. For example:

l1=['a','b','c','c']
l2=['001','002','003','004']
l3=['a','a','b','b','c','c','c','c']
l4=['006','007','008','009','010','011','012','013']

dfa=pd.DataFrame(list(zip(l1,l2)),columns=['A1','A2'])
dfb=pd.DataFrame(list(zip(l3,l4)),columns=['B1','B2'])

dfa
Out[968]: 
  A1   A2
0  a  001
1  b  002
2  c  003
3  c  004

dfb
Out[969]: 
  B1   B2
0  a  006
1  a  007
2  b  008
3  b  009
4  c  010
5  c  011
6  c  012
7  c  013

My desired outcome once I 'merge' on columns A1 and B1 is that I only end up with 4 rows, with the first instances of each A1 / B1 value joining with each other, and the second instances of each A1 / B1 value joining with each other, and so on :-

  A1   A2   B2
0  a  001  006
1  b  002  008
2  c  003  010
3  c  004  011

I can't left-merge then groupby and take the first() / head(1), because c in A1 isn't unique. Neither can I drop_duplicates() on B1 in dfb, for the same reason. Beyond that, aside from something complicated using loops, I am out of ideas.

Has anyone got any profound insights they can share?


Solution

  • Let's try to create a 'key' column with groupby()+cumcount() to track position then merge on A1,B1 and key:

    dfa['key']=dfa.groupby(['A1']).cumcount()
    dfb['key']=dfb.groupby(['B1']).cumcount()
    df=dfa.merge(dfb,left_on=['A1','key'],right_on=['B1','key']).drop(['B1','key'],1)
    

    output of df:

       A1   A2      B2
    0   a   001     006
    1   b   002     008
    2   c   003     010
    3   c   004     011