Search code examples
rdataframedplyrmediansummary

Group medians from a data frame using dplyr


Computing medians seems to be a bit of an achilles heel for R (ie. no data.frame method). What is the least amount of typing needed to get group medians from a data frame using dplyr?

my_data <- structure(list(group = c("Group 1", "Group 1", "Group 1", "Group 1", 
"Group 1", "Group 1", "Group 1", "Group 1", "Group 1", "Group 1", 
"Group 1", "Group 1", "Group 1", "Group 1", "Group 1", "Group 2", 
"Group 2", "Group 2", "Group 2", "Group 2", "Group 2", "Group 2", 
"Group 2", "Group 2", "Group 2", "Group 2", "Group 2", "Group 2", 
"Group 2", "Group 2"), value = c("5", "3", "6", "8", "10", "13", 
"1", "4", "18", "4", "7", "9", "14", "15", "17", "7", "3", "9", 
"10", "33", "15", "18", "6", "20", "30", NA, NA, NA, NA, NA)), .Names = c("group", 
"value"), class = c("tbl_df", "data.frame"), row.names = c(NA, 
-30L))

library(dplyr)  

# groups 1 & 2
my_data_groups_1_and_2 <- my_data[my_data$group %in% c("Group 1", "Group 2"), ]

# compute medians per group
medians <- my_data_groups_1_and_2 %>%
  group_by(group) %>%
  summarize(the_medians = median(value, na.rm = TRUE)) 

Which gives:

Error in summarise_impl(.data, dots) : 
  STRING_ELT() can only be applied to a 'character vector', not a 'double'
In addition: Warning message:
In mean.default(sort(x, partial = half + 0L:1L)[half + 0L:1L]) :
  argument is not numeric or logical: returning NA

What is the least effort workaround here?


Solution

  • As commented by ivyleavedtoadflax, the error is caused by supplying a non-numeric or non-logical argument to median, since your value column is of type character (you can easily tell that they are not numeric by seeing that the numbers are quoted). Here are two simple ways to solve it:

    my_data %>% 
      filter(group %in% c("Group 1", "Group 2")) %>%
      group_by(group) %>%
      summarize(the_medians = median(as.numeric(value), na.rm = TRUE)) 
    

    Or

    my_data %>% 
      filter(group %in% c("Group 1", "Group 2")) %>%
      mutate(value = as.numeric(value))  %>%
      group_by(group) %>%
      summarize(the_medians = median(value, na.rm = TRUE)) 
    

    To check the structure including type of columns in your data, you could conveniently use

    str(my_data)
    #Classes ‘tbl_df’ and 'data.frame': 30 obs. of  2 variables:
    # $ group: chr  "Group 1" "Group 1" "Group 1" "Group 1" ...
    # $ value: chr  "5" "3" "6" "8" ...