Search code examples
rdataframedplyr

Rowwise averaging in Dlyr in R with multiple grouping categories


I have random df with likert scale answers. All columns are questions named as q1,q2,q3,...,q6.

I now have a another data frame that gives me the groups of questions. For example, q1,q2,q3 are in group A, q4 in group B, q5 and q6 in group C.

I want to first calculate the mean (rowwise) of each group. For example, in the new resulted data frame I must have column A with the mean (rowwise) of q1,q2,q3.

So I did:


likert_levels <- c(1,2,3,4,5)
set.seed(42)
library(dplyr)
df <-
  tibble(
    "q1" = sample(likert_levels, 150, replace = TRUE),
    "q2" = sample(likert_levels, 150, replace = TRUE, prob = 5:1),
    "q3" = sample(likert_levels, 150, replace = TRUE, prob = 1:5),
    "q4" = sample(likert_levels, 150, replace = TRUE, prob = 1:5),
    "q5" = sample(c(likert_levels, NA), 150, replace = TRUE),
    "q6" = sample(likert_levels, 150, replace = TRUE, prob = c(1, 0, 1, 1, 0))
  ) %>%
  mutate(across(everything(), ~ factor(.x, levels = likert_levels)))

df
df2 = tibble(categories = c("A","A","A","B","C","C"),
             questions = c("q1","q2","q3","q4","q5","q6"))

df2
df%>%
  mutate(id = row_number())%>%
  tidyr::pivot_longer(!id,names_to = "questions",values_to = "responses")%>%
  left_join(.,df2,by="questions")



df_cor=df%>%
  mutate_if(is.factor,as.double)%>%
  rowwise() %>%
  mutate(QA = mean(c(q1, q2, q3),na.rm=TRUE),
         QB = mean(c(q4),na.rm=TRUE),
         QC = mean(c(q5, q6),na.rm=TRUE))%>%
  select(QA,QB,QC)
df_cor

My question is: because my real life data set contains 100 questions and more than 20 groups, how can I avoid typing in the rowwise mean mutation in dplyr and instead have it be done automatically with a different style of grouping?


Solution

  • You already have the first part of the solution i.e to bring the data in long format. Continue using the long format and calculate the mean for each id and category. If needed you can transform the data in wide format again.

    library(dplyr)
    library(tidyr)
    
    df %>%
      mutate(id = row_number())%>%
      pivot_longer(!id,names_to = "questions",values_to = "responses")%>%
      left_join(df2,by="questions") %>%
      summarise(value = mean(as.integer(as.character(responses)), na.rm = TRUE), 
               .by = c(id, categories)) %>%
      pivot_wider(names_from = categories, names_prefix = "Q", values_from = value)%>%
      select(-id)
    
    # A tibble: 150 × 3
    #      QA    QB    QC
    #   <dbl> <dbl> <dbl>
    # 1  2.33     2   3  
    # 2  3.33     5   3  
    # 3  2        3   2.5
    # 4  2.33     3   4.5
    # 5  3.67     4   3  
    # 6  3.33     5   3  
    # 7  2.67     5   3  
    # 8  2.33     4   2.5
    # 9  2.67     4   4  
    #10  3.67     5   2.5
    # ℹ 140 more rows
    # ℹ Use `print(n = ...)` to see more rows