Search code examples
pythonstringpandasdifference

Find the difference between strings for each two rows of pandas data.frame


I am new in python, and I am struggling with this for some time. I have a file that looks like this:

    name   seq
1   a1     bbb
2   a2     bbc
3   b1     fff
4   b2     fff
5   c1     aaa
6   c2     acg

where name is the name of the string and seq is the string. I would like a new column or a new data frame that indicates the number of differences between every two rows without overlap. For example, I want the number of differences between sequences for the name [a1-a2] then [b1-b2] and lastly between [c1-c2].

So I need something like this:

    name   seq   diff  
1   a1     bbb    NA   
2   a2     bbc    1
3   b1     fff    NA
4   b2     fff    0
5   c1     aaa    NA
6   c2     acg    2

Any help is highly appreciated


Solution

  • It looks like you want the jaccard distance of the pairs of strings. Here's one way using groupby and scipy.spatial.distance.jaccard:

    from scipy.spatial.distance import jaccard
    g = df.groupby(df.name.str[0])
    
    df['diff'] = [sim for _, seqs in g.seq for sim in 
                  [float('nan'), jaccard(*map(list,seqs))]]
    

    print(df)
    
      name  seq  diff
    1   a1  bbb   NaN
    2   a2  bbc   1.0
    3   b1  fff   NaN
    4   b2  fff   0.0
    5   c1  aaa   NaN
    6   c2  acg   2.0