in R data table summarizing producing inconsistent results
DT = data.table(
A = rep(1:3, each = 5L),
B = rep(1:5, 3L),
C = sample(15L),
D = sample(15L)
)
DT[, .(suma = sum(A), sumb = sum(B), sumc=sum(C), sumd= sum(D)), by=A]
Consciously summarizing a grouping variable is added. However this produces
# A data.table: 3 × 5
A suma sumb sumc sumd
<int> <int> <int> <int> <int>
1 1 15 40 36
2 2 15 39 38
3 3 15 41 46
which is not correct. However modifying this like
DT[, lapply(.SD, sum), by=A, .SDcols=c('A' ,'B','C','D')]
give correct results
A data.table: 3 × 5
A A B C D
<int> <int> <int> <int> <int>
1 5 15 40 36
2 10 15 39 38
3 15 15 41 46
Is this inconsistency expected or any reason why they are giving different results?
I can reproduce this with data.table 1.15.41
and I agree the output is far from intuitive.
What seems to be happening is this: when you perform an operation like DT[, .(suma = sum(A)), by = A]
, data.table
treats A
in the j
expression as the grouping key rather than as a vector of values in the A
column. This means sum(A)
operates on the group identifier - a scalar - not the actual data within each group.
We can see this if we try to take the length()
of the values.
DT[, .(len_a = length(A), len_b = length(B)), A]
# A len_a len_b
# <int> <int> <int>
# 1: 1 1 5
# 2: 2 1 5
# 3: 3 1 5
Contrast this with using .SD
:
DT[, lapply(.SD, length), A, .SDcols = c("A", "B")]
# A A B
# <int> <int> <int>
# 1: 1 5 5
# 2: 2 5 5
# 3: 3 5 5
Verbose output provides a little more insight:
DT[, .(len_a = length(A), len_b = length(B)), A, verbose = TRUE]
# Detected that j uses these columns: [B]
# Finding groups using forderv ... forder.c received 15 rows and 1 columns
# 0.000s elapsed (0.000s cpu)
# Finding group sizes from the positions (can be avoided to save RAM) ... 0.000s elapsed (0.000s cpu)
# lapply optimization is on, j unchanged as 'list(length(A), length(B))'
# A len_a len_b
# <int> <int> <int>
# 1: 1 1 5
# 2: 2 1 5
# 3: 3 1 5
The output shows that data.table
skips using the A
column as a vector:
Detected that j uses these columns: [B]
Yet it also claims:
j unchanged as 'list(length(A), length(B))'
This is a little misleading as the A
here is the group key instead of the A
column.
I skimmed the Aggregations section of the data.table
docs and I couldn't find anything alluding to this behaviour. However, @Gusbourne has pointed out in a comment that this is mentioned in the data.table
FAQ 2.10, Inside each group, why are the group variables length-1? The answer is: for efficiency and convenience.
I think it's quite unusual to perform numeric operations on a grouping column. Nevertheless, the output is not what I would expect and to me this seems like a bug to me - at least in the prominence of this in documentation, if not in implementation. If one did not already exist, it might be worth submitting a GitHub issue. However, @Gusbourne has drawn my attention to a related issue from 2018. This seems to be considered a feature and the issue has been closed so I think the best thing to do is just avoid the undesired behaviour using the method in your question, by using .SD
.