Search code examples
rtidyverseaggregatepairwise

Aggregate sum of column values for all pairwise groupings of other columns in a dataframe in R


I have been trying to aggregate the sum of one column in a dataframe for all pairwise comparisons of the other columns in a data frame. The dataset I have is very large, but below is a dummy set that illustrates the issue I am having. I want to be able to do this so I can not repeat a ton of code to individually get these pairwise summations.

library(tidyverse)
library(broom)

data <- data.frame(team= c('A','B','C','A','B', 'A'),
       height= c('tall', 'short', 'tall','short','tall','tall'),
       size= c('big','small','big','big','small','small'),
       cost= c(5,5,4,4,5,4))

lapply(1:ncol(data), function(i) aggregate(data$cost~., data[c(1, i)], sum)) 

#This gives the results below grouping just first column (team) against
#the others and getting a sum :

[[1]]
  team team.1 data$cost
1     A       A        13
2     B       B        10
3     C       C         4

[[2]]
  team height data$cost
1     A  short         4
2     B  short         5
3     A   tall         9
4     B   tall         5
5     C   tall         4

[[3]]
  team  size data$cost
1     A   big         9
2     C   big         4
3     A small         4
4     B small        10

[[4]]
  team data$cost
1     A        13
2     B        10
3     C         4

What I want to avoid is having to manually replace the column number in the aggregate function, indicated by data[c(1, i)] to get the next set of pairwise groupings. Again the actual data frame is much much larger and this would be tedious.

I tried the following code and attempted to create a list of lists that I could unnest:

all_comparisons <- lapply(1:ncol(data), function(i) aggregate(data$cost~., 
                                                       data[c(c(1:i), i)], sum))

huge_list_all_comparisons <- all_comparisons %>% bind_rows(all_comparisons) %>%    # make larger sample data
  mutate_if(is.list, simplify_all) %>%    # flatten each list element internally 
  unnest()  

>huge_list_all_comparisons
 A tibble: 40 × 8
   team team.1 `data$cost` height height.1 size  size.1 cost.1
   <chr> <chr>         <dbl> <chr>  <chr>    <chr> <chr>   <dbl>
 1 A     A                13 NA     NA       NA    NA         NA
 2 B     B                10 NA     NA       NA    NA         NA
 3 C     C                 4 NA     NA       NA    NA         NA
 4 A     NA                4 short  short    NA    NA         NA
 5 B     NA                5 short  short    NA    NA         NA
 6 A     NA                9 tall   tall     NA    NA         NA
 7 B     NA                5 tall   tall     NA    NA         NA
 8 C     NA                4 tall   tall     NA    NA         NA
 9 A     NA                4 short  NA       big   big        NA
10 A     NA                5 tall   NA       big   big        NA
# … with 30 more rows

which returns cost sums of every single possible grouping, not just pairwise (in the real data set this would be prohibitive and lead to over a million rows of comparisons)

I would greatly appreciate any help getting some code that I could use to accomplish this pairwise group aggregation across the dataframe


Solution

  • You can use combn() to get the possible combinations of indices and then lapply() over that.

    library(tidyverse)
    
    data |> 
      seq_along() |> 
      combn(2, simplify = F) |> 
      lapply(\(i) aggregate(data$cost~., data[c(i[1], i[2])], sum)) 
    #> [[1]]
    #>   team height data$cost
    #> 1    A  short         4
    #> 2    B  short         5
    #> 3    A   tall         9
    #> 4    B   tall         5
    #> 5    C   tall         4
    #> 
    #> [[2]]
    #>   team  size data$cost
    #> 1    A   big         9
    #> 2    C   big         4
    #> 3    A small         4
    #> 4    B small        10
    #> 
    #> [[3]]
    #>   team data$cost
    #> 1    A        13
    #> 2    B        10
    #> 3    C         4
    #> 
    #> [[4]]
    #>   height  size data$cost
    #> 1  short   big         4
    #> 2   tall   big         9
    #> 3  short small         5
    #> 4   tall small         9
    #> 
    #> [[5]]
    #>   height data$cost
    #> 1  short         9
    #> 2   tall        18
    #> 
    #> [[6]]
    #>    size data$cost
    #> 1   big        13
    #> 2 small        14
    

    Created on 2022-03-30 by the reprex package (v2.0.1)