Search code examples
rdataframedplyrbioinformaticssummarize

Fast way to summarize a data frame across columns


I have this data.frame of five possible character states (genotypes):

genotypes <- c("0/0","1/1","0/1","1/0","./.")
library(dplyr)
set.seed(1)
df <- do.call(rbind, lapply(1:100, function(i)
  matrix(sample(genotypes, 30, replace = T), nrow = 1, dimnames = list(NULL, paste0("V", 1:30))))) %>%
  data.frame()

And I wan to summarize each row into how many I have of each:

  • ref.hom (0/0)
  • alt.hom (1/1)
  • het (0/1 or 1/0)
  • na (./.)

This seems rather slow:

sum.df <- do.call(rbind,lapply(1:nrow(df), function(i){
  data.frame(ref.hom = length(which(df[i,] == "0/0")),
             alt.hom = length(which(df[i,] == "1/1")),
             het = length(which(df[i,] == "0/1") | which(df[i,] == "1/0")),
             na = length(which(df[i,] == "./.")))
}))

Any more efficient, perhaps dplyr based way to do this?


Solution

  • With dplyr, you can try:

    df %>%
     transmute(ref.hom = rowSums(. == "0/0"),
               alt.hom = rowSums(. == "1/1"),
               het = rowSums(. == "0/1") + rowSums(. == "1/0"),
               na = rowSums(. == "./."))
    
        ref.hom alt.hom het na
    1         4      11   9  6
    2         5       2  20  3
    3         3      11  10  6
    4         5       5  15  5
    5         5       4  17  4
    6         3       8  13  6
    7         6       8  11  5
    8         4       8  11  7
    9         6       6  14  4
    10       14       8   5  3