Search code examples
rdplyrreshapesummarize

Summaries based on one reference column compared to the other columns in dplyr


I want to get sums of a variable depending on the other variable's na or non-na values in R. A working example code is below:

library(dplyr)
df <- data.frame(A = c(1,2,3,NA,4),
                 B = c(NA,2,3,NA,5),
                 C = c(3,4,NA,NA,NA),
                 REF = c(10,20,30,40,50))

df.na <- df %>% mutate_at(vars(-REF),is.na)

sums <- matrix(0,2,3)
row.names(sums) <- c("NON-NA","NA")
colnames(sums) <- c("A","B","C")

for(i in 1:3){
  sums[,i] <- df.na %>% group_by_at(i) %>% summarise(sum=sum(REF)) %>% select(sum) %>% unlist()
}

> sums
         A   B   C
NON-NA 110 100  30
NA      40  50 120

For example, since 4th term in the A column is NA, the corresponding column values are 30 and 10+20+3+50 = 150-30 = 120 in sums object.

My question is how do I get this output without a for loop?


Solution

  • Here is a solution using the pivot_ functions from tidyr. The approach pivots to a longer form so that you can group by column name and whether the column value is NA.

    library(dplyr)
    library(tidyr)
    df %>%
      pivot_longer(cols = c("A", "B", "C")) %>%
      mutate(isna = is.na(value)) %>%
      group_by(name, isna) %>%
      summarize(value = sum(REF)) %>%
      pivot_wider()
    
      isna      A     B     C
      <lgl> <dbl> <dbl> <dbl>
    1 FALSE   110   100    30
    2 TRUE     40    50   120