Search code examples
rloopsstatisticsdata-sciencedata-analysis

How to apply Fisher's exact test to each column of count data split into two categories?


I have count data (columns) in the form of presence/absence (1/0) of various genes in different samples that belong to one of two categories. I am trying to do a Fisher's (fisher.test) for each gene and output the data into a separate table. Because I have over 100 genes (names are more complicated than Gene1, Gene2,...), I would like to apply the test to each column containing gene count data, skipping the first few columns that contain categorizing metadata, and have the output data for all genes go into a separate table, labelled with the column name used to calculate the data.

The answers I have found have tabulated and grouped counts rather than columns of raw ungrouped count data, so the layout is quite different.

I have tried a few different things, like a for loop, apply() with a function in it, etc., but because the column name is part of the calculation formulas, and because I need to skip the first 8 columns of metadata, it's become quite complex and I can't get it to work properly.

Sample data:

mydata <- data.frame(sampleID = c("A", "B", "C", "D", "E", "F", "G"),
                     category = c("high", "low", "high", "high", "low", "high", "low"),
                     Gene1 = c(1, 1, 0, 0, 0, 1, 1),
                     Gene2 = c(0, 1, 1, 1, 1, 1, 0),
                     Gene3 = c(0, 0, 0, 1, 1, 1, 1)

The simple form of the test, applied to just one column, and data output into console rather than saved, worked:

matrix = with(mydata, table(category, Gene1))  
fisher.test(with(mydata, matrix))

I tried something like:

for(x in 3:ncol(mydata)) {
  matrix = with(mydata, table(category, x))
  fisher.test(with(mydata, matrix))
}

I tried using apply:

df = mydata
output.file <- fxn = function(x){
  matrix = with(df, table(Site, x))
  fisher.test(with(df, matrix))
}

apply(x = df,
      FUN = fxn,
      MARGIN = 2)

I know it's wrong, but I don't know how to make it right. PS. I'm pretty new to R, so if you can explain how your code works, it will help me to replicate it again.


Solution

  • update to address genes with no variability: Try this:

    mydata %>%
      select(-sampleID) %>%
      pivot_longer(cols = -category, names_to = "gene") %>%
      group_by(gene) %>%
       # filter out genes with no variability
      filter(sum(value) > 0 & sum(value) < n()) %>%
      summarise(fisher_test = list({
        if(sum(value) == n() || sum(value) == 0) {
          # if all values are 1's or 0's, return NA for p-value and estimate
          tibble(method = "Fisher's exact test", alternative = "two.sided",
                 estimate = NA_real_, p.value = NA_real_)
        } else {
          # perform Fisher's exact test for the remaining genes
          tidy(fisher.test(table(category, value)))
        }
      })) %>%
      unnest(fisher_test) %>%
      mutate(odds_ratio = exp(estimate)) %>% 
      select(-method, -alternative)
    

    First answer: Here is a tidyverse broom combination:

    library(dplyr)
    library(tidyr)
    library(broom)
    
    mydata %>%
      select(-sampleID) %>%
      pivot_longer(cols = -category, names_to = "gene") %>%
      group_by(gene) %>%
      summarise(fisher_test = list(tidy(fisher.test(table(category, value))))) %>%
      unnest(fisher_test) %>%
      mutate(odds_ratio = exp(estimate)) %>% 
      select(-method, -alternative)
    
      gene  estimate p.value conf.low conf.high odds_ratio
      <chr>    <dbl>   <dbl>    <dbl>     <dbl>      <dbl>
    1 Gene1    1.81        1  0.0469      176.        6.11
    2 Gene2    0.707       1  0.00640      78.2       2.03
    3 Gene3    1.81        1  0.0469      176.        6.11