Search code examples
rsimilarity

similarity index in a list of character vectors


I have a list that looks like this one:

$`264`
[1] "CHAMP1" "MAP1S"  "PRRC1"  "TUT1"   "CDK12" 

$`265`
[1] "TUT1"   "PRRC1"  "CHAMP1" "MAP1S"

$`266`
[1] "REPS1"  "CHAMP1" "PRRC1"  "TUT1"   "MAP1S" 

$`267`
[1] "G3BP1"  "TUT1"   "PRRC1"  "CHAMP1" "MAP1S" 

$`268`
[1] "TUT1"   "CHAMP1" "PRRC1"  "MAP1S"  

$`269`
[1] "DDB1"   "CHAMP1" "TUT1"   "PRRC1"  "MAP1S"

Is there any package or function to calculate the similarity among the different list components?

Many thanks


Solution

  • I'm not aware of any packages, but this implements your own metric (from your comment):

    siml  <- function(x,y) {
      length(intersect(lst[[x]],lst[[y]]))/length(union(lst[[x]],lst[[y]]))
    }
    z      <- expand.grid(x=1:length(lst),y=1:length(lst))
    result <- mapply(siml,z$x,z$y)
    dim(result) <- c(length(lst),length(lst))
    result
    #       [,1] [,2]  [,3]  [,4] [,5]  [,6]
    # [1,] 1.000  0.8 0.667 0.667  0.8 0.667
    # [2,] 0.800  1.0 0.800 0.800  1.0 0.800
    # [3,] 0.667  0.8 1.000 0.667  0.8 0.667
    # [4,] 0.667  0.8 0.667 1.000  0.8 0.667
    # [5,] 0.800  1.0 0.800 0.800  1.0 0.800
    # [6,] 0.667  0.8 0.667 0.667  0.8 1.000
    

    This is a (slightly) more efficient way to do the same thing:

    result <- sapply(lst,function(x) 
                sapply(lst,function(y,x)length(intersect(x,y))/length(union(x,y)),x))