Search code examples
rdplyrhypothesis-test

Pre Post hypothesis testing for multiple groups in long format r- /dplyr


I have a dataset in long format with multiple groups that I need to do pre and post intervention hypothesis testing for each group.

I'm trying to do this by grouping at the group level and carrying out the test on the value and time point, though for some reason the p values I'm getting don't make any sense. They're all the same... see below example:

# Load the required library
library(dplyr)

# Set seed for reproducibility
set.seed(123)

# Create a dataframe with unique ids, timepoints, foodgroups, and values
data <- data.frame(
  id = rep(1:10, each = 2),  # Increased sample size
  timepoint = rep(c("before", "after"), times = 100),
  group = rep(c("A", "B", "C", "D", "E"), each = 40),  # Adjusted for larger sample size
  value = rnorm(200)  # Generating random values for illustration
)

# Perform t-test for each foodgroup
result <- data %>%
  group_by(group) %>%
  summarise(
    p_value = wilcox.test(value ~ timepoint, data = ., paired = TRUE)$p.value
  )

# Print the results
print(result)

For example if I just select the group as below, I get a unique and presumably accurate p-value.

I guess there's some issue with how I am grouping them?

# Perform t-test for each foodgroup
result <- data %>%
  filter(group=='B') %>%
  summarise(
    p_value = wilcox.test(value ~ timepoint, data = ., paired = TRUE)$p.value
  )

# Print the results
print(result)

Can anyone recommend identify the issue in this or suggest a better way to achieve this?


Solution

  • wilcox.test() ignores tibble grouping so your code actually computes this:

    wilcox.test(value ~ timepoint, data=data, paired=T)$p.value
    # [1] 0.4340859
    

    Base R

    You can achieve what you want by applying wilcox.test() to the data subsets like this:

    sapply(split(data, ~ group), 
           \(gr) wilcox.test(value ~ timepoint, data=gr, paired=T)$p.value)
    #         A         B         C         D         E 
    # 0.3883762 0.8123550 0.5458755 0.2773552 0.6215134 
    

    dplyr

    We can use group_modify() to iterate over the groups:

    data %>%
      group_by(group) %>%
      group_modify(~ {
        wilcox.test(value ~ timepoint, data=., paired=T)$p.value %>%
          data.frame()
      }) %>%
      set_names(c('group', 'p_value'))
    # # A tibble: 5 × 2
    # # Groups:   group [5]
    # group   p_value
    # <chr>     <dbl>
    # 1 A       0.388
    # 2 B       0.812
    # 3 C       0.546
    # 4 D       0.277
    # 5 E       0.622