Search code examples
rmatrixdistanceidentitydna-sequence

DNA pairwise distances from R matrix


When working with DNA, we often need the triangular p-distance matrix, which contains the proportion of non-identical sites between pairs of sequences. Thus:

  1. AGGTT
  2. AGCTA
  3. AGGTA

Yields:

      1    2
2   0.4
3   0.2  0.2

The p-distance calculation is available in certain R packages, but suppose I need to use numerical code (-1,0,1,2), rather than letters (C,T,A,G). How do I generate the triangular p-distance matrix from "my.matrix"?

# Define DNA matrix dimensions
bp = 5  # DNA matrix length
n  = 3  # DNA matrix height
# Build Binary Matrices
purine <- matrix(sample(0:1,(bp*n),replace=TRUE,prob=c(0.5,0.5)),n,bp)
ketone <- matrix(sample(0:1,(bp*n),replace=TRUE,prob=c(0.5,0.5)),n,bp)
strong <- 1-(abs(purine-ketone))
my.matrix <- (purine*strong-ketone)+(purine*ketone-strong)+purine+ketone
my.matrix

Solution

  • I'm not sure what you are doing with my.matrix, but this should work with either characters or numbers

    x<-c("AGGTT", "AGCTA", "AGGTA")
    y<-do.call("rbind", strsplit(x, "")) 
    y
         [,1] [,2] [,3] [,4] [,5]
    [1,] "A"  "G"  "G"  "T"  "T" 
    [2,] "A"  "G"  "C"  "T"  "A" 
    [3,] "A"  "G"  "G"  "T"  "A" 
    z <- apply(y, 1, function(x) colMeans(x != t(y)) )
    z
         [,1] [,2] [,3]
    [1,]  0.0  0.4  0.2
    [2,]  0.4  0.0  0.2
    [3,]  0.2  0.2  0.0
    

    And you can probably use lower or upper.tri to get a triangle if needed. Also, if the apply function looks confusing, it's just applying this function to all three rows...

    y[1,] == t(y)
         [,1]  [,2]  [,3]
    [1,] TRUE  TRUE  TRUE
    [2,] TRUE  TRUE  TRUE
    [3,] TRUE FALSE  TRUE
    [4,] TRUE  TRUE  TRUE
    [5,] TRUE FALSE FALSE
    

    ...and this returns the first row in the distance matrix

    colMeans(y[1,] != t(y))
    [1] 0.0 0.4 0.2