Search code examples
rgroup-bymeandata-cleaningdata-wrangling

How can I use group_by and mutate together to calculate the means of certain columns?


I am trying to calculate the means of four groups. My data frame looks something similar to the following:

Sex <- c("F", "F", "M", "M", "F")
Phenotype <- c(Control, Experimental, Experimental, Control, Control)
MOp_Amygdala <- c("10", "15", "2", "6", "8")
MOp_Thalamus <- c("19", "12", "4", "4", "6")
MOp_Cerebellum <- c("34", "45", "67", "78", "99")
MOq_Cortex <- c("2", "5", "6", "17", "2")
MOq_Striatum  <- c("100", "101", "102", "106", "200")

df <- data.frame(Sex, Phenotype, MOp_Amygdala, MOp_Thalamus, MOp_Cerebellum, MOq_Cortex, MOq_Striatum)

I want to find the mean of the amygdala, Thalamus, and Cerebellum of my four groups: M-Control, M-Experimental, F-Control and F-Experimental.

This I what I have tried so far:

Q1 <- data %>% 
  group_by(Sex, Phenotype)%>%
  select(starts_with("MOp")) %>%
  rowwise() %>%
  mutate(Group_Means = mean(c(MOp_Amygdala, MOp_Thalamus, MOp_Cerebellum))) #redundant 

My main issue with the output is that it seems like group_by is not working. Instead of 4 observations (M-Control, M-Experimental, F-Control, and F-Experimental) I end up with 5 observations, one for each sample.


Solution

  • You can calculate the mean of each group by summarizing the data frame.

    I've modified your input data:

    1. Phenotype values are now quoted
    2. Numeric data are no longer wrapped in quotes
    3. I generated some extra rows just for this demonstration
    Sex <- rep(c("F", "F", "M", "M", "F"), 5)
    Phenotype <- rep(c('Control', 'Experimental', 'Experimental', 'Control', 'Control'), 5)
    MOp_Amygdala <- c(10, 15, 2, 6, 8, sample(seq(1,20,1), 20, replace = TRUE))
    MOp_Thalamus <- c(19, 12, 4, 4, 6, sample(seq(1,20,1), 20, replace = TRUE))
    MOp_Cerebellum <- c(34, 45, 67, 78, 99, sample(seq(20,100,1), 20, replace = TRUE))
    MOq_Cortex <- c(2, 5, 6, 17, 2, sample(seq(1,20,1), 20, replace = TRUE))
    MOq_Striatum  <- c(100, 101, 102, 106, 200, sample(seq(100,200,1), 20, replace = TRUE))
    
    df <- data.frame(Sex, Phenotype, MOp_Amygdala, MOp_Thalamus, MOp_Cerebellum, MOq_Cortex, MOq_Striatum)
    
    library(tidyverse)
    
    glimpse(df)
    #> Rows: 25
    #> Columns: 7
    #> $ Sex            <chr> "F", "F", "M", "M", "F", "F", "F", "M", "M", "F", "F", …
    #> $ Phenotype      <chr> "Control", "Experimental", "Experimental", "Control", "…
    #> $ MOp_Amygdala   <dbl> 10, 15, 2, 6, 8, 16, 6, 14, 3, 2, 16, 20, 15, 15, 2, 8,…
    #> $ MOp_Thalamus   <dbl> 19, 12, 4, 4, 6, 14, 9, 12, 2, 9, 17, 17, 4, 7, 16, 9, …
    #> $ MOp_Cerebellum <dbl> 34, 45, 67, 78, 99, 73, 21, 94, 30, 75, 54, 80, 48, 27,…
    #> $ MOq_Cortex     <dbl> 2, 5, 6, 17, 2, 6, 8, 5, 10, 4, 7, 14, 8, 1, 12, 11, 12…
    #> $ MOq_Striatum   <dbl> 100, 101, 102, 106, 200, 192, 193, 162, 121, 198, 109, …
    

    Here's one way to calculate averages for each group for these three columns:

    df %>%
      summarise(across(starts_with('MOp'), mean),
                .by = c(Sex, Phenotype))
    #>   Sex    Phenotype MOp_Amygdala MOp_Thalamus MOp_Cerebellum
    #> 1   F      Control          7.9         12.3           61.5
    #> 2   F Experimental         14.6         13.2           52.6
    #> 3   M Experimental         12.4         10.8           73.6
    #> 4   M      Control          7.2          5.4           50.4
    

    Created on 2023-07-24 with reprex v2.0.2