How does wilcox_test understand which column specifies the individual sample in paired test?

How do I know that wilcox_test (package rstatix) recognizes the correct column for each individual sample, when doing a paired test?

Here is an example:

install.packages("rstatix")
install.packages("datarium")
library(rstatix)
library(datarium)

data("mice2", package = "datarium")
mice2.long <- mice2 %>% gather(key = "group", value = "weight", before, after)

mice2.long  %>% wilcox_test(weight ~ group, paired = T)

It seems the test works correctly, but I didn't specify the column "id" to represent the individual sample designation, and thus how did the test understand that this column identified the "pairs"?

Solution

The argument passed to paired is TRUE. the function will correlate the first value of before to the first value of after and so on . It does not need the column id. But if the data is not arranged such as the first value of before directly correlated to the first value of after, the function wilcox_text would give incorrect results.

Here is a quick example:

set.seed(0)
dat <- data.frame(id = 1:10,matrix(rnorm(20, 15,2),,2))|>
  setNames(c('id', 'before', 'after'))
 
dat %>%
  gather(key = "group", value = "weight", before, after) %>%
  rstatix::wilcox_test(weight~group, paired  = TRUE)
# A tibble: 1 × 7
  .y.    group1 group2    n1    n2 statistic     p
* <chr>  <chr>  <chr>  <int> <int>     <dbl> <dbl>
1 weight after  before    10    10        14 0.193

Now if we randomize the long data, such that the 1st value of before does not correspond with the first value of after, we should get different results

set.seed(1)
dat %>%
   gather(key = "group", value = "weight", before, after) %>%
   slice_sample(n = 20)%>%
   rstatix::wilcox_test(weight~group, paired  = TRUE)
# A tibble: 1 × 7
  .y.    group1 group2    n1    n2 statistic     p
* <chr>  <chr>  <chr>  <int> <int>     <dbl> <dbl>
1 weight after  before    10    10        15 0.232

Try again with a different seed and you get different results.

Note that this is not the same for your case, ie randomizing mice2 does not produce different results. Why? because all the values of before are smaller than all the values of after. ie the maximum of the values before is smaller than the minimum of the values after:

mice2$before|>max()
[1] 235
mice2$after|>min()
[1] 337

This is very critical in computing the wilcox statistic in that regardless of the permutation, all the differences of after - before will be positive and thus all the ranks will be grouped as positive thereby we just need to sum(1:10) = 55. This is the test statistic.


mice2 %>% 
  gather(key = "group", value = "weight", before, after)%>%
  rstatix::wilcox_test(weight~group, paired  = TRUE)

# A tibble: 1 × 7
  .y.    group1 group2    n1    n2 statistic       p
1 weight after  before    10    10         55 0.00195