I am new in python, and I am struggling with this for some time. I have a file that looks like this:
name seq
1 a1 bbb
2 a2 bbc
3 b1 fff
4 b2 fff
5 c1 aaa
6 c2 acg
where name is the name of the string and seq is the string. I would like a new column or a new data frame that indicates the number of differences between every two rows without overlap. For example, I want the number of differences between sequences for the name [a1-a2] then [b1-b2] and lastly between [c1-c2].
So I need something like this:
name seq diff
1 a1 bbb NA
2 a2 bbc 1
3 b1 fff NA
4 b2 fff 0
5 c1 aaa NA
6 c2 acg 2
Any help is highly appreciated
It looks like you want the jaccard distance of the pairs of strings. Here's one way using groupby
and scipy.spatial.distance.jaccard
:
from scipy.spatial.distance import jaccard
g = df.groupby(df.name.str[0])
df['diff'] = [sim for _, seqs in g.seq for sim in
[float('nan'), jaccard(*map(list,seqs))]]
print(df)
name seq diff
1 a1 bbb NaN
2 a2 bbc 1.0
3 b1 fff NaN
4 b2 fff 0.0
5 c1 aaa NaN
6 c2 acg 2.0