I have two very large df: df 1 and df2. Df 1 contains the columns "from", "to" and "count". The values in "from" and "to" represent commuting points and can occur multiple times:
from1 | to1 | count |
---|---|---|
10020 | 10020 | 20 |
10020 | 10020 | 10 |
10020 | 22001 | NA |
30030 | 20020 | 2 |
45001 | 32001 | 100 |
45001 | 32001 | NA |
45001 | 45001 | 1 |
90080 | 45002 | NA |
In df 2, I would like to create each possible combination for "from" and "to". Then, I would like to fill in a new column "count_total" the sum of commuters for each individual pair. If the combination does not occur in df 1, I would like to fill in 0. For NA, I would like to fill in 0. My desired output:
from2 | to2 | count_total |
---|---|---|
10020 | 10020 | 30 |
10020 | 22001 | 0 |
10020 | 20020 | 0 |
10020 | 32001 | 0 |
10020 | 45001 | 0 |
10020 | 45002 | 0 |
30030 | 10020 | 0 |
30030 | 22001 | 0 |
30030 | 20020 | 2 |
...
I tried the following, however, it did not sum up the values for "count_total" correctly.
df2 <- CJ(from2 = unique(df1$from1),
to2 = unique(df1$to1))
df2[, count_total := sum(df1$count[
df1$from1 == from2 &
df1$to1 == to2
]), by = .(from2, to2)]
What did I do wrong? Thanks!
We can do a merge-then-summarize:
library(data.table)
setDT(df1)
CJ(from2 = unique(df1$from1), to2 = unique(df1$to1)
)[df1, count2 := i.count, on = .(from2==from1, to2==to1)
][, .(count2 = sum(c(0, count2), na.rm = TRUE)), by = .(from2, to2)]
# from2 to2 count2
# <int> <int> <num>
# 1: 10020 10020 10
# 2: 10020 20020 0
# 3: 10020 22001 0
# 4: 10020 32001 0
# 5: 10020 45001 0
# 6: 10020 45002 0
# 7: 30030 10020 0
# 8: 30030 20020 2
# 9: 30030 22001 0
# 10: 30030 32001 0
# ---
# 15: 45001 22001 0
# 16: 45001 32001 0
# 17: 45001 45001 1
# 18: 45001 45002 0
# 19: 90080 10020 0
# 20: 90080 20020 0
# 21: 90080 22001 0
# 22: 90080 32001 0
# 23: 90080 45001 0
# 24: 90080 45002 0