r dataframe dplyr bioinformatics summarize

Fast way to summarize a data frame across columns

I have this data.frame of five possible character states (genotypes):

genotypes <- c("0/0","1/1","0/1","1/0","./.")
library(dplyr)
set.seed(1)
df <- do.call(rbind, lapply(1:100, function(i)
  matrix(sample(genotypes, 30, replace = T), nrow = 1, dimnames = list(NULL, paste0("V", 1:30))))) %>%
  data.frame()

And I wan to summarize each row into how many I have of each:

ref.hom (0/0)
alt.hom (1/1)
het (0/1 or 1/0)
na (./.)

This seems rather slow:

sum.df <- do.call(rbind,lapply(1:nrow(df), function(i){
  data.frame(ref.hom = length(which(df[i,] == "0/0")),
             alt.hom = length(which(df[i,] == "1/1")),
             het = length(which(df[i,] == "0/1") | which(df[i,] == "1/0")),
             na = length(which(df[i,] == "./.")))
}))

Any more efficient, perhaps dplyr based way to do this?

Solution

With dplyr, you can try:

df %>%
 transmute(ref.hom = rowSums(. == "0/0"),
           alt.hom = rowSums(. == "1/1"),
           het = rowSums(. == "0/1") + rowSums(. == "1/0"),
           na = rowSums(. == "./."))

    ref.hom alt.hom het na
1         4      11   9  6
2         5       2  20  3
3         3      11  10  6
4         5       5  15  5
5         5       4  17  4
6         3       8  13  6
7         6       8  11  5
8         4       8  11  7
9         6       6  14  4
10       14       8   5  3