Search code examples
rsequencedplyrhamming-distance

R - Compute Mismatch By Group


I was wondering how could I compute mismatching cases by group.

Let us imagine that this is my data :

sek = rbind(c(1, 'a', 'a', 'a'), 
        c(1, 'a', 'a', 'a'), 
        c(2, 'b', 'b', 'b'), 
        c(2, 'c', 'b', 'b'))

colnames(sek) <- c('Group', paste('t', 1:3, sep = ''))

The data look like this

     Group t1  t2  t3 
[1,] "1"   "a" "a" "a"
[2,] "1"   "a" "a" "a"
[3,] "2"   "b" "b" "b"
[4,] "2"   "c" "b" "b"

In order to get something like

Group 1 : 0 
Group 2 : 1 

It would be fancy to use the stringdist library to compute this.

Something like

seqdistgroupStr = function(x) stringdistmatrix(x, method = 'hamming')

sek %>% 
  as.data.frame() %>% 
  group_by(Group) %>% 
  seqdistgroupStr() 

But it is not working.

Any ideas ?

Quick Update: How would you solve the question of weights? For example, how could I pass an argument - a value (1,2,3, ...) - when setting the mistmatch between two characters. Like the mismatch between b and c cost 2 while the mismatch between a and c cost 1 and so on.


Solution

  • The code below will give you the number of mismatches by group, where a mismatch is defined as one less than the number of unique values in each column t1, t2, etc. for each level of Group. I think you would need to bring in a string distance measure only if you need more than a binary measure of mismatch, but a binary measure suffices for the example you gave. Also, if all you want is the number of distinct rows in each group, then @Alex's solution is more concise.

    library(dplyr)
    library(reshape2)
    
    sek %>% as.data.frame %>%
      melt(id.var="Group") %>%
      group_by(Group, variable) %>%
      summarise(mismatch = length(unique(value)) - 1) %>%
      group_by(Group) %>%
      summarise(mismatch = sum(mismatch))
    
      Group mismatch
    1     1        0
    2     2        1
    

    Here's a shorter dplyr method to count individual mismatches. It doesn't require reshaping, but it requires other data gymnastics:

    sek %>% as.data.frame %>%
      group_by(Group) %>%
      summarise_each(funs(length(unique(.)) - 1)) %>%
      mutate(mismatch = rowSums(.[-1])) %>%
      select(-matches("^t[1-3]$"))