Search code examples
pythondataframemergefuzzy

substring merge two dataframes of different sizes based on a column


I have 2 dataframes like these:

df1

Alias   seq_RNA
Hsa-Mir-133-P2-v1-5p    AGCTGGTAAAATGGAACCAAATC
Hsa-Mir-143-P1-v2-3p    TTGGTCCCCTTCAACCAGCTGT
Hsa-Mir-183-P1-v1-3p    TTTGGTCCCCTTCAACCAGCTGT
Hsa-Mir-490-3p  CAACCTGGAGGACTCCATGCTGT
Hsa-Mir-499-5p  TTAAGACTTGCAGTGATGTTTA
Hsa-Mir-163-P2-v2-3p    TTGGTCCCCTTCAACCAGCTA
Hsa-Mir-122-as-5p   TTTAGTGTGATAATGGCGTTTG
Hsa-Mir-1-P1-5p ACATACTTCTTTATATGCCCATA

df2

    Alias   MiRBase_ID
    Hsa-Mir-4-P1    hsa-let-133 
    Hsa-Let-7-P1c   hsa-let-7c 
    Hsa-Let-7-P1d   hsa-let-7a-2 
    Hsa-Mir-183-P1  hsa-let-183 
    Hsa-Mir-122-as  hsa-let-122 
    Hsa-Let-7-P2a3  hsa-let-7f-2 

I want to generate a new df3 where if there is a substring match of one of the alias between df1 and df2 it create a new column adding the corresponding MiRBase_ID, like this:

Alias   seq_RNA MiRBase_ID
    Hsa-Mir-133-P2-v1-5p    AGCTGGTAAAATGGAACCAAATC na
    Hsa-Mir-143-P1-v2-3p    TTGGTCCCCTTCAACCAGCTGT na
    Hsa-Mir-183-P1-v1-3p    TTTGGTCCCCTTCAACCAGCTGT hsa-let-183
    Hsa-Mir-490-3p  CAACCTGGAGGACTCCATGCTGT na
    Hsa-Mir-499-5p  TTAAGACTTGCAGTGATGTTTA na
    Hsa-Mir-163-P2-v2-3p    TTGGTCCCCTTCAACCAGCTA na
    Hsa-Mir-122-as-5p   TTTAGTGTGATAATGGCGTTTG hsa-let-122
    Hsa-Mir-1-P1-5p ACATACTTCTTTATATGCCCATA na

So basically when it finds a match or even a partial match (but the best match possible) of one alias in df1 in df2 it takes the MiRBase_ID and put it in the new df3. otherwhise put na if no match is found. there is no order in the 2 dataframes so each of the best match needs to be searched in the whole Alias column and df1 and df2 are of different nr of rows.


Solution

  • Use str.extract to get the substring and merge:

    import re
    pattern = '|'.join(df2['Alias'].map(re.escape))
    
    out = df1.merge(df2.drop(columns='Alias'),
                    left_on=df1['Alias'].str.extract(f'({pattern})', expand=False),
                    right_on=df2['Alias'], how='left').drop(columns='key_0')
    

    Output:

                      Alias                  seq_RNA   MiRBase_ID
    0  Hsa-Mir-133-P2-v1-5p  AGCTGGTAAAATGGAACCAAATC          NaN
    1  Hsa-Mir-143-P1-v2-3p   TTGGTCCCCTTCAACCAGCTGT          NaN
    2  Hsa-Mir-183-P1-v1-3p  TTTGGTCCCCTTCAACCAGCTGT  hsa-let-183
    3        Hsa-Mir-490-3p  CAACCTGGAGGACTCCATGCTGT          NaN
    4        Hsa-Mir-499-5p   TTAAGACTTGCAGTGATGTTTA          NaN
    5  Hsa-Mir-163-P2-v2-3p    TTGGTCCCCTTCAACCAGCTA          NaN
    6     Hsa-Mir-122-as-5p   TTTAGTGTGATAATGGCGTTTG  hsa-let-122
    7       Hsa-Mir-1-P1-5p  ACATACTTCTTTATATGCCCATA          NaN