Search code examples
pythonrsequencen-gramprotein-database

using ngram in clustering protein data (ngram.NGram.compare equivalent in R)


There is some sequence data to be compared. The expected output is the distance matrix which shows how similar each sequence is to the others. Previously, I used ngram.NGram.compare in Python and now I want to switch to R. I found ngram and biogram package but I was unable to find the exact function which generate the expected output.

Assume this is the data

a <- c("ham","bam","comb")

The output should be like this (distance between each item):

#      ham    bam   comb
#ham    0     0.5   0.83
#bam   0.5     0     0.6
#comb  0.83   0.6     0

It is the equivalent Python code for the output:

a = ["ham","bam","comb"]
import ngram
[(1 - ngram.NGram.compare(a[i],a[j],N=1))  
                          for i in range(len(a)) 
                          for j in range((i+1),len(a)) ]

Solution

  • you could use stringdistmatrix from the stringdist package. Check the stringdist-metrics documentation which metrics are available.

    a <- c("ham","bam","comb")
    stringdist::stringdistmatrix(a, a, method = "jaccard")
    
              [,1] [,2]      [,3]
    [1,] 0.0000000  0.5 0.8333333
    [2,] 0.5000000  0.0 0.6000000
    [3,] 0.8333333  0.6 0.0000000