Search code examples
rdplyraggregatena

R Group By and Sum to Ignore NA


I have a data frame like this:

library(dplyr)

name <- c("Bob", "Bob", "Bob", "Bob", "John", "John", "John")
count <- c(2, 3, 4, 5, 2, 3, 4)
score <- c(5, NA, NA, NA, 3, 4, 2)

my_df <- data.frame(cbind(name, count, score)) %>%
  mutate(count = as.numeric(count),
         score = as.numeric(score))
my_df

  name count score
1  Bob     2     5
2  Bob     3    NA
3  Bob     4    NA
4  Bob     5    NA
5 John     2     3
6 John     3     4
7 John     4     2

Then I create another column by taking the product between count and score:

my_df %>%
  mutate(product = count*score) 

  name count score product
1  Bob     2     5      10
2  Bob     3    NA      NA
3  Bob     4    NA      NA
4  Bob     5    NA      NA
5 John     2     3       6
6 John     3     4      12
7 John     4     2       8

I want to group by name and aggregate for the sum(product)/sum(count) but I want the sum of product column to ignore any NA values in the sum (I did this below) AND I want any associated count values to be ignored in the summation. This is my current solution, but it is not right. Bob's result is calculated as 10/(2+3+4+5) = 0.71 but I want Bob's result to be 10/2 = 5.

my_df %>%
  mutate(product = count*score)
  group_by(name) %>%
  summarize(result = sum(product, na.rm = TRUE)/sum(count))

  name  result
  <chr>  <dbl>
1 Bob    0.714
2 John   2.89

Solution

  • We may need to subset the count by the non-NA values in 'product'

    library(dplyr)
    my_df %>%
       mutate(product = count*score) %>% 
       group_by(name) %>% 
       summarise(result = sum(product, na.rm = TRUE)/sum(count[!is.na(product)]))
    

    -output

    # A tibble: 2 × 2
      name  result
      <chr>  <dbl>
    1 Bob     5   
    2 John    2.89
    

    Or do a filter before the grouping

    my_df %>% 
       filter(complete.cases(score)) %>% 
       group_by(name) %>% 
       summarise(result = sum(score * count)/sum(count))
    # A tibble: 2 × 2
      name  result
      <chr>  <dbl>
    1 Bob     5   
    2 John    2.89