I was wondering how could I compute mismatching cases by group.
Let us imagine that this is my data :
sek = rbind(c(1, 'a', 'a', 'a'),
c(1, 'a', 'a', 'a'),
c(2, 'b', 'b', 'b'),
c(2, 'c', 'b', 'b'))
colnames(sek) <- c('Group', paste('t', 1:3, sep = ''))
The data look like this
Group t1 t2 t3
[1,] "1" "a" "a" "a"
[2,] "1" "a" "a" "a"
[3,] "2" "b" "b" "b"
[4,] "2" "c" "b" "b"
In order to get something like
Group 1 : 0
Group 2 : 1
It would be fancy to use the stringdist
library to compute this.
Something like
seqdistgroupStr = function(x) stringdistmatrix(x, method = 'hamming')
sek %>%
as.data.frame() %>%
group_by(Group) %>%
seqdistgroupStr()
But it is not working.
Any ideas ?
Quick Update: How would you solve the question of weights? For example, how could I pass an argument - a value (1,2,3, ...) - when setting the mistmatch between two characters. Like the mismatch between b and c cost 2 while the mismatch between a and c cost 1 and so on.
The code below will give you the number of mismatches by group, where a mismatch is defined as one less than the number of unique values in each column t1, t2, etc. for each level of Group. I think you would need to bring in a string distance measure only if you need more than a binary measure of mismatch, but a binary measure suffices for the example you gave. Also, if all you want is the number of distinct rows in each group, then @Alex's solution is more concise.
library(dplyr)
library(reshape2)
sek %>% as.data.frame %>%
melt(id.var="Group") %>%
group_by(Group, variable) %>%
summarise(mismatch = length(unique(value)) - 1) %>%
group_by(Group) %>%
summarise(mismatch = sum(mismatch))
Group mismatch
1 1 0
2 2 1
Here's a shorter dplyr
method to count individual mismatches. It doesn't require reshaping, but it requires other data gymnastics:
sek %>% as.data.frame %>%
group_by(Group) %>%
summarise_each(funs(length(unique(.)) - 1)) %>%
mutate(mismatch = rowSums(.[-1])) %>%
select(-matches("^t[1-3]$"))