Search code examples
rdplyrfdr

Using a temporal inner variable in dplyr outside of the group


I need to calculate an FDR variable per group, using an expected random distribution of p values (corresponds to the "Random" type).

library(dplyr)
library(data.table)
calculate_empirical_fdr = function(control_pVal, test_pVal) {
  m_control = length(control_pVal)
  m_test = length(test_pVal)
  unlist(lapply(test_pVal, function(significance_threshold) {
    m_control = length(control_pVal)
    m_test = length(test_pVal)

    FP_expected = length(control_pVal[control_pVal<=significance_threshold])*m_test/m_control # number of 
expected false positives in a p-value sequence with the size m_test
    S = length(test_pVal[test_pVal<=significance_threshold]) # number of significant hits (FP + TP)
    return(FP_expected/S)
  }))
}

An example dataset with groups I need to control for in the "Group" variable:

set.seed(42)
library(dplyr); library(data.table)
dataset_test = data.table(Type = c(rep("Random", 500),
                                   rep("test1", 500),
                                   rep("test2", 500)),
                          Group = sample(c("group1", "group2", "group3"), 1500, replace = T),
                          Pvalue = c(runif(n = 500),
                                     rbeta(n = 500, shape1 = 1, shape2 = 4),
                                     rbeta(n = 500, shape1 = 1, shape2 = 6))
                          )

Data visualization: enter image description here

I have found that the best way to use my function per group would be using a temporal variable where I can store the p values of the random type, but this does not work:

dataset_test %>%
  group_by(Group) %>%
  {filter(Type=="Random") %>% select(Pvalue) ->> control_set } %>%
  group_by(Type, add = T) %>%
  mutate(FDR_empirical = calculate_empirical_fdr(control_pVal = control_set,
                              test_pVal = Pvalue)) %>%
  data.table()

Error in filter(Type == "Random") : object 'Type' not found

I understand that probably temporal vairables "do not see" the environment within the data.table, would be glad to hear any suggestions how to fix it.


Solution

  • You can do something like this, which filters the control group P-values using the data.table special .BY

    setDT(dataset_test)
    dataset_test[
      i= Type!="Random",
      j = FDR_empirical:=calculate_empirical_fdr(dataset_test[Type=="Random" &  Group ==.BY$Group, Pvalue], Pvalue),
      by = .(Group, Type)
    ]
    

    Output:

            Type  Group     Pvalue FDR_empirical
       1: Random group1 0.70292111            NA
       2: Random group1 0.72383117            NA
       3: Random group1 0.76413459            NA
       4: Random group1 0.87942702            NA
       5: Random group2 0.71229213            NA
      ---                                       
    1496:  test2 group3 0.34817178     0.3681791
    1497:  test2 group1 0.22419118     0.2308988
    1498:  test2 group3 0.07258545     0.2314655
    1499:  test2 group2 0.24687976     0.2849462
    1500:  test2 group1 0.12206777     0.1760657