Search code examples
mergedata.tableuniquekey-valuecross-join

Pair each combination of two columns and calculate sum for a third column in data.table


I have two very large df: df 1 and df2. Df 1 contains the columns "from", "to" and "count". The values in "from" and "to" represent commuting points and can occur multiple times:

from1 to1 count
10020 10020 20
10020 10020 10
10020 22001 NA
30030 20020 2
45001 32001 100
45001 32001 NA
45001 45001 1
90080 45002 NA

In df 2, I would like to create each possible combination for "from" and "to". Then, I would like to fill in a new column "count_total" the sum of commuters for each individual pair. If the combination does not occur in df 1, I would like to fill in 0. For NA, I would like to fill in 0. My desired output:

from2 to2 count_total
10020 10020 30
10020 22001 0
10020 20020 0
10020 32001 0
10020 45001 0
10020 45002 0
30030 10020 0
30030 22001 0
30030 20020 2

...

I tried the following, however, it did not sum up the values for "count_total" correctly.

    df2 <- CJ(from2 = unique(df1$from1), 
                to2 = unique(df1$to1))


    df2[, count_total := sum(df1$count[
             df1$from1 == from2 &
               df1$to1 == to2
                ]), by = .(from2, to2)]

What did I do wrong? Thanks!


Solution

  • We can do a merge-then-summarize:

    library(data.table)
    setDT(df1)
    CJ(from2 = unique(df1$from1), to2 = unique(df1$to1)
      )[df1, count2 := i.count, on = .(from2==from1, to2==to1)
      ][, .(count2 = sum(c(0, count2), na.rm = TRUE)), by = .(from2, to2)]
    #     from2   to2 count2
    #     <int> <int>  <num>
    #  1: 10020 10020     10
    #  2: 10020 20020      0
    #  3: 10020 22001      0
    #  4: 10020 32001      0
    #  5: 10020 45001      0
    #  6: 10020 45002      0
    #  7: 30030 10020      0
    #  8: 30030 20020      2
    #  9: 30030 22001      0
    # 10: 30030 32001      0
    # ---                   
    # 15: 45001 22001      0
    # 16: 45001 32001      0
    # 17: 45001 45001      1
    # 18: 45001 45002      0
    # 19: 90080 10020      0
    # 20: 90080 20020      0
    # 21: 90080 22001      0
    # 22: 90080 32001      0
    # 23: 90080 45001      0
    # 24: 90080 45002      0