Search code examples
rcomparisonpairwiselong-running-task

A more efficient way of doing grouped pairwise comparisons in a large dataset?


I have data that looks like this:

Tab4 <- read.table(text = "
  nodepair  `++`  `--`  `+-`  `-+`  `0+`  `+0`  `0-`  `-0`  `00` ES   
1 A1_A1        0     4     0     0     0     0     0     0    16 3    
2 A1_A1        0     5     0     0     0     0     0     0    16 4    
3 A1_A1        0     5     0     0     0     0     0     0    15 5    
", header = TRUE)

and I've written this code so that each group 'ES' is pairwise compared by nodepair:

ES_combs <- combn(unique(Tab4$ES), 2, simplify = FALSE)

Tab5 <- Tab4 %>%                            ########### compare every pair to eachother
  group_split(nodepair) %>% 
  map(.f = function(df) df %>%
        map(.x = 1:length(ES_combs),
            .f = ~df %>% 
              filter(ES %in% ES_combs[[.x]]) %>% 
              summarize(nodepair = first(nodepair),
                        ES_1 = ES[1],
                        ES_2 = ES[2], 
                        across(2:10, ~as.numeric(.))))) %>%
  bind_rows()

resulting in this:

Tab5 <- read.table(text = "
  nodepair ES_1  ES_2   `++`  `--`  `+-`  `-+`  `0+`  `+0`  `0-`  `-0`  `00`
1 A1_A1    3     4         0     4     0     0     0     0     0     0    16
2 A1_A1    3     4         0     5     0     0     0     0     0     0    16
3 A1_A1    3     5         0     4     0     0     0     0     0     0    16
4 A1_A1    3     5         0     5     0     0     0     0     0     0    15
5 A1_A1    4     5         0     5     0     0     0     0     0     0    16
6 A1_A1    4     5         0     5     0     0     0     0     0     0    15    
", header = TRUE)

This works but takes much too long when I'm comparing my full dataset. I was hoping there is a more effective code? I suspect that this warning I get is exposing part of the problem:

Warning messages:
  1: Returning more (or less) than 1 row per `summarise()` group was deprecated in dplyr 1.1.0.
ℹ Please use `reframe()` instead.
ℹ When switching from `summarise()` to `reframe()`, remember that `reframe()` always returns an ungrouped
data frame and adjust accordingly.

but I'm not sure where to go from here.


Solution

  • We can do an inner join and remove duplicates:

    out <- merge(Tab4,Tab4[,c('nodepair','ES')],by='nodepair',suffixes=c("1","2"),all=T)
    out[out$ES1!=out$ES2,]
    
      nodepair X.... X.....1 X.....2 X.....3 X.0.. X..0. X.0...1 X..0..1 X.00. ES1 ES2
    2    A1_A1     0       4       0       0     0     0       0       0    16   3   4
    3    A1_A1     0       4       0       0     0     0       0       0    16   3   5
    4    A1_A1     0       5       0       0     0     0       0       0    16   4   3
    6    A1_A1     0       5       0       0     0     0       0       0    16   4   5
    7    A1_A1     0       5       0       0     0     0       0       0    15   5   3
    8    A1_A1     0       5       0       0     0     0       0       0    15   5   4