Search code examples
rstatisticsgenome

R hypergeometric test between a character vector and a list, calculating p values in a loop


I'm trying to write a code myself to run the hypergeometric test in R using phyper.

I have a character vector of upregulated genes: (or these are "red" balls I pulled out from my urn)

gene.up <- c("A", "B", "C", "D")

I also have a character vector of all genes found in my experiment: (or these are all balls - "white" and "red" - that I pulled out from my urn)

gene.background <- c("A", "B", "C", "D", "E", "F")

I also have a list of characters containing pathway information: (or each of the "pathway" is a subset of balls I pulled out from my urn, and in this case, my urn has 5 white and 4 red balls)

gene.pathway.list <- list("pathwayA" = c("A", "F", "G"),
                          "pathwayB" = c("A", "B", "E", "H"),
                          "pathwayC" = c("D", "G", "I"))

Now I need to run a hypergeometric test for each pathway in my gene.pathway.list. So I created an empty data frame to store the pathway names and p values from the hypergeometric test, and created a loop of tests like below.

df <- data.frame(pathway=character(length(gene.pathway.list)), pvalue=numeric(length(gene.pathway.list)))

for (i in c(1:length(gene.pathway.list))) {
  df[i,1] <- names(gene.pathway.list[i])
  df[i,2] <- phyper(sum(gene.pathway.list[[i]] == gene.up), length(gene.pathway.list[[i]]), 
length(unique(unlist(gene.pathway.list))) - length(gene.pathway.list[[i]]),  
length(gene.background))
}

However, the output values don't make any sense - for example, my p value for pathway C is zero, but how can the possibility of pulling out "C" and "D" be zero? I'm trying to figure out what went wrong, what did I set up incorrectly?


Solution

  • We could use %in% instead of ==

    for (i in c(1:length(gene.pathway.list))) {
      df[i,1] <- names(gene.pathway.list[i])
      df[i,2] <- phyper(sum(gene.pathway.list[[i]] %in% gene.up), length(gene.pathway.list[[i]]), 
    length(unique(unlist(gene.pathway.list))) - length(gene.pathway.list[[i]]),  
    length(gene.background))
    }
    

    -output

    > df
       pathway    pvalue
    1 pathwayA 0.1071429
    2 pathwayB 0.2142857
    3 pathwayC 0.1071429
    

    == is elementwise comparison operator. The lengths of the lhs and rhs elements were not the same, thus the shorter length recycles and creates the anomaly. Instead, use %in%