Search code examples
rgroup-bydplyrsummarize

sd function returns NA when using group_by() and summarise() in dplyr (no NA values in df)


I've got a df with a binary numeric response variable (0 or 1) and several response variables. I am trying to create a table that groups by type (a 3 level variable) and step (7 levels). I want the mean response and standard deviation for each type at each step. The output table should have 21 rows with 4 variables: type, step, mean and sd.

My code looks like this:

data <- data %>% group_by(step, type) %>% summarise(Response = mean(Response), dev = sd(Response))  

The output table correctly generates the mean values, but returns NA for all sd values. I tried using 'na.rm=TRUE' to remove NA values but there aren't any in the original df for response. Any ideas?


Solution

  • The following should work as you expect:

    data <- data %>% group_by(step, type) %>% summarise(Response_mean = mean(Response), dev = sd(Response))  
    

    The reason, as mentioned, that you are getting NA, is because you are inputting a single value to sd().

    However, the reason that happens is related to the order in which things happen in your code. The following part in your code:

    summarise(Response = mean(Response)
    

    is creating a variable named 'Response' in your new table, holding a single value - the mean of the vector 'Response' in your original data. The following part:

    dev = sd(Response)
    

    tries to calculate the standard deviation of that single value.

    To illustrate, you can try this as well:

    data <- data %>% group_by(step, type) %>% summarise(Response = mean(Response), Response_plus_10 = Response + 10)  
    

    Hope this clarifies the issue.