Search code examples
rdplyrtidyrp-valuekruskal-wallis

Kruskal-Wallis statistic & p-value matrix loop for data subsets in R


Consider a dataset "trees" which has several factors and two numerical continuous variables. Two of these variables, plot.type (with classes "one", "ten", "non") and demog (with classes "seed", "sap", "adult"), are used to subset the data. For every subset, Kruskal-Wallis tests should be run on the two continuous variables: count & prop, using the grouping variable guild (with classes "fast", "med", "slow").

Without individually writing out the test on each subset and subset-of-subset of the data, is there a quick way in R to accomplish this task and put calculated p values to a matrix?

Example data

library(dplyr)
set.seed(123)
Data <- tbl_df(
  data.frame(
    plot.type = as.factor(rep(c("one", "ten", "non"), each = 80)),
    guild.type = as.factor(sample(c("fast", "med", "slow"), replace = TRUE, prob = c(.12, .28, .60))),
    demog = as.factor(sample(c("seed", "sap", "adult"),, prob = c(.33, .33, .33))),
    prop   = runif(240, 0, 10),
    count    = runif(240, 0, 10))
  )%>%
  group_by(plot.type)
trees <- as.data.frame(Data)

Was looking at other similar questions:

  1. R: Kruskal-Wallis test in loop over specified columns in data frame
  2. Kruskal - Wallis p-value matrix for data subsets with R

Tried this:

library(dplyr)
library(tidyr)
nm1 <- names(lhs)[4:5]
f1 <- function(x,y) kruskal.test(x~y)$p.value

tree %>% 
  do({data.frame(Map(f1, .[nm1], list(.$compare_by)))}) %>% 
  unite(Data_subsets, demog, plot.type, sep="_")

But missing how to include the guild infos as the grouping variable for the KW tests. Perhaps I should try to transform guild to a wide format table here?

Example of desired output:

Data_subsets fast.stat  fast.p  med.stat  med.p  slow.stat slow.p          
1   seed_one   <value>  <value> <value>  <value>  <value> <value>
2   seed_ten   <value>  <value> <value>  <value>  <value> <value>
3   seed_non   <value>  <value> <value>  <value>  <value> <value>
4   sap_one    <value>  <value> <value>  <value>  <value> <value>
5   sap_ten    <value>  <value> <value>  <value>  <value> <value>
6   sap_non    <value>  <value> <value>  <value>  <value> <value>
7   adult_one   <value>  <value> <value>  <value>  <value> <value>
8   adult_ten   <value>  <value> <value>  <value>  <value> <value>
9   adult_non   <value>  <value> <value>  <value>  <value> <value>

Any suggestions? Thanks in advance!


Solution

  • If I understand you correctly, you can just unite all three factors together, run the Kruskal-Wallis test on all the resulting group, then spread your table wide to have the fast, med, slow separated by columns.

    
    library(dplyr)
    set.seed(12)
    Data <- tbl_df(
      data.frame(
        plot.type = as.factor(rep(c("one", "ten", "non"), each = 80)),
        guild.type = as.factor(sample(c("fast", "med", "slow"), replace = TRUE, prob = c(.12, .28, .60))),
        demog = as.factor(sample(c("seed", "sap", "adult"),, prob = c(.33, .33, .33))),
        prop   = runif(240, 0, 10),
        count    = runif(240, 0, 10))
      )%>%
      group_by(plot.type)
    trees <- as.data.frame(Data)
    
    
    library(dplyr)
    library(tidyr)
    
    trees %>% 
      unite(Data_subsets, demog, plot.type, guild.type, sep="_", remove = F) %>% 
      group_by(Data_subsets) %>% 
      summarise(stats=kruskal.test(count~prop)$stat, p = kruskal.test(count~prop)$p.value) %>% 
      separate(Data_subsets, c("demog", "plot.type", "guild.type")) %>% 
      unite(Subsets, demog, plot.type, sep="_") %>% 
      pivot_wider(id_cols = Subsets, 
                names_from = guild.type, 
                values_from = c("stats", "p"))
    

    This would result in something like this:

     A tibble: 9 x 7
      Subsets   stats_med stats_slow stats_fast  p_med p_slow p_fast
      <chr>         <dbl>      <dbl>      <dbl>  <dbl>  <dbl>  <dbl>
    1 adult_non        26         NA         NA  0.463 NA     NA    
    2 adult_one        26         NA         NA  0.463 NA     NA    
    3 adult_ten        25         NA         NA  0.462 NA     NA    
    4 sap_non          NA         25         NA NA      0.462 NA    
    5 sap_one          NA         26         NA NA      0.463 NA    
    6 sap_ten          NA         26         NA NA      0.463 NA    
    7 seed_non         NA         NA         26 NA     NA      0.463
    8 seed_one         NA         NA         25 NA     NA      0.462
    9 seed_ten         NA         NA         26 NA     NA      0.463
    

    The NAs are just there because the data does not have that comparison as is. I would choose to keep the table without pivoting_wider. Something like this:

    trees %>% 
      unite(Data_subsets, demog, plot.type, guild.type, sep="_", remove = F) %>% 
      group_by(Data_subsets) %>% 
      summarise(stats=kruskal.test(count~prop)$stat, p = kruskal.test(count~prop)$p.value) %>% 
      separate(Data_subsets, c("demog", "plot.type", "guild.type")) 
    

    results:

    # A tibble: 9 x 5
      demog plot.type guild.type stats        p
      <chr> <chr>     <chr>      <dbl>    <dbl>
    1 adult non       med          266 4.38e-22
    2 adult one       med          266 4.38e-22
    3 adult ten       med          265 6.25e-22
    4 sap   non       slow         265 6.25e-22
    5 sap   one       slow         266 4.38e-22
    6 sap   ten       slow         266 4.38e-22
    7 seed  non       fast         266 4.38e-22
    8 seed  one       fast         265 6.25e-22
    9 seed  ten       fast         266 4.38e-22
    

    I hope this helps.

    H