Search code examples
dplyrstandard-deviationsummarize

Calculate Means and SDs for each column using dplyr´s "summarise_all"


I have a large dataset, containing a categorical (Size) and a numeric variable(Fraktion) to group the data. The rest are analytical resultd like numeric values. Every value got sampled 3 times, and I needed to create the meaning.

It looks more or less like this:

Size Fraktion Sample Value1 Value2 ...
A 1 1 3 2 ...
A 1 2 4 4 ...
A 1 3 2 1 ...
A 2 1 1 5 ...
A 2 2 3 7 ...
A 2 3 4 5 ...
B 1 1 2 3 ...
B 1 2 3 2 ...
B 1 3 4 2 ...
B 1 3 2 4 ...

To calculate the means of the samples I used the summarise function of dyplr like this:

mean_df<-
  df %>%
  group_by(Fraktion,Size)%>%
  summarise_all("mean")

I guess this might not be the most elegant way, as in the next step I have to remove the "Sample" column, but it worked for me. Now I want to integrate the standard deviation for each created mean and add it to the df.

I found this thread ((Can I calculate the standard error of all columns with the "summarise_all" function in R dplyr)and tried to use the code provided by Ronak Shah in answer 3:

mean_sd_df<-
  df %>%
  group_by(Fraktion, Size)%>%
  summarise_each(funs(mean,sd,se=sd(.)/sqrt(n())))

However, I get the follwoing error:

across() must only be used inside dplyr verbs. Any idea what could be the issue?


Solution

  • The classic tidyverse way to do this, is first make your data tidy by pivoting your table to long format:

    df %>% 
      pivot_longer(starts_with("Value"), names_to = "Measurement", values_to = "val")
    

    and then summarize this long table:

    df %>% 
      pivot_longer(starts_with("Value"), names_to = "Measurement", values_to = "val") %>%
      group_by(Fraktion, Size) %>%
      summarize(mean = mean(val), se = sd(val)/sqrt(n()))
    

    If you want the summary by "ValueX", add the name of the new column to the grouping:

    df %>% 
      pivot_longer(starts_with("Value"), names_to = "Measurement", values_to = "val") %>%
      group_by(Fraktion, Size, Measurement) %>%
      summarize(mean = mean(val), se = sd(val)/sqrt(n()))