r group-by mean data-cleaning data-wrangling

How can I use group_by and mutate together to calculate the means of certain columns?

I am trying to calculate the means of four groups. My data frame looks something similar to the following:

Sex <- c("F", "F", "M", "M", "F")
Phenotype <- c(Control, Experimental, Experimental, Control, Control)
MOp_Amygdala <- c("10", "15", "2", "6", "8")
MOp_Thalamus <- c("19", "12", "4", "4", "6")
MOp_Cerebellum <- c("34", "45", "67", "78", "99")
MOq_Cortex <- c("2", "5", "6", "17", "2")
MOq_Striatum  <- c("100", "101", "102", "106", "200")

df <- data.frame(Sex, Phenotype, MOp_Amygdala, MOp_Thalamus, MOp_Cerebellum, MOq_Cortex, MOq_Striatum)

I want to find the mean of the amygdala, Thalamus, and Cerebellum of my four groups: M-Control, M-Experimental, F-Control and F-Experimental.

This I what I have tried so far:

Q1 <- data %>% 
  group_by(Sex, Phenotype)%>%
  select(starts_with("MOp")) %>%
  rowwise() %>%
  mutate(Group_Means = mean(c(MOp_Amygdala, MOp_Thalamus, MOp_Cerebellum))) #redundant

My main issue with the output is that it seems like group_by is not working. Instead of 4 observations (M-Control, M-Experimental, F-Control, and F-Experimental) I end up with 5 observations, one for each sample.

Solution

You can calculate the mean of each group by summarizing the data frame.

I've modified your input data:

Phenotype values are now quoted
Numeric data are no longer wrapped in quotes
I generated some extra rows just for this demonstration

Sex <- rep(c("F", "F", "M", "M", "F"), 5)
Phenotype <- rep(c('Control', 'Experimental', 'Experimental', 'Control', 'Control'), 5)
MOp_Amygdala <- c(10, 15, 2, 6, 8, sample(seq(1,20,1), 20, replace = TRUE))
MOp_Thalamus <- c(19, 12, 4, 4, 6, sample(seq(1,20,1), 20, replace = TRUE))
MOp_Cerebellum <- c(34, 45, 67, 78, 99, sample(seq(20,100,1), 20, replace = TRUE))
MOq_Cortex <- c(2, 5, 6, 17, 2, sample(seq(1,20,1), 20, replace = TRUE))
MOq_Striatum  <- c(100, 101, 102, 106, 200, sample(seq(100,200,1), 20, replace = TRUE))

df <- data.frame(Sex, Phenotype, MOp_Amygdala, MOp_Thalamus, MOp_Cerebellum, MOq_Cortex, MOq_Striatum)

library(tidyverse)

glimpse(df)
#> Rows: 25
#> Columns: 7
#> $ Sex            <chr> "F", "F", "M", "M", "F", "F", "F", "M", "M", "F", "F", …
#> $ Phenotype      <chr> "Control", "Experimental", "Experimental", "Control", "…
#> $ MOp_Amygdala   <dbl> 10, 15, 2, 6, 8, 16, 6, 14, 3, 2, 16, 20, 15, 15, 2, 8,…
#> $ MOp_Thalamus   <dbl> 19, 12, 4, 4, 6, 14, 9, 12, 2, 9, 17, 17, 4, 7, 16, 9, …
#> $ MOp_Cerebellum <dbl> 34, 45, 67, 78, 99, 73, 21, 94, 30, 75, 54, 80, 48, 27,…
#> $ MOq_Cortex     <dbl> 2, 5, 6, 17, 2, 6, 8, 5, 10, 4, 7, 14, 8, 1, 12, 11, 12…
#> $ MOq_Striatum   <dbl> 100, 101, 102, 106, 200, 192, 193, 162, 121, 198, 109, …

Here's one way to calculate averages for each group for these three columns:

df %>%
  summarise(across(starts_with('MOp'), mean),
            .by = c(Sex, Phenotype))
#>   Sex    Phenotype MOp_Amygdala MOp_Thalamus MOp_Cerebellum
#> 1   F      Control          7.9         12.3           61.5
#> 2   F Experimental         14.6         13.2           52.6
#> 3   M Experimental         12.4         10.8           73.6
#> 4   M      Control          7.2          5.4           50.4

^{Created on 2023-07-24 with reprex v2.0.2}