Search code examples
rdata.table

R data.table summary not consistent between explicit sum and .SD


in R data table summarizing producing inconsistent results

DT = data.table(
  A = rep(1:3, each = 5L), 
  B = rep(1:5, 3L),
  C = sample(15L),
  D = sample(15L)
)
DT[, .(suma = sum(A), sumb = sum(B), sumc=sum(C), sumd= sum(D)), by=A]

Consciously summarizing a grouping variable is added. However this produces

# A data.table: 3 × 5 
A   suma    sumb    sumc    sumd
<int>   <int>   <int>   <int>   <int>
1   1   15  40  36
2   2   15  39  38
3   3   15  41  46

which is not correct. However modifying this like

DT[, lapply(.SD, sum), by=A, .SDcols=c('A' ,'B','C','D')]

give correct results

A data.table: 3 × 5 
A   A   B   C   D
<int>   <int>   <int>   <int>   <int>
1   5   15  40  36
2   10  15  39  38
3   15  15  41  46

Is this inconsistency expected or any reason why they are giving different results?


Solution

  • I can reproduce this with data.table 1.15.41 and I agree the output is far from intuitive.

    What seems to be happening is this: when you perform an operation like DT[, .(suma = sum(A)), by = A], data.table treats A in the j expression as the grouping key rather than as a vector of values in the A column. This means sum(A) operates on the group identifier - a scalar - not the actual data within each group.

    We can see this if we try to take the length() of the values.

    DT[, .(len_a = length(A), len_b = length(B)), A]
    #        A len_a len_b
    #    <int> <int> <int>
    # 1:     1     1     5
    # 2:     2     1     5
    # 3:     3     1     5
    

    Contrast this with using .SD:

    DT[, lapply(.SD, length), A, .SDcols = c("A", "B")]
    #        A     A     B
    #    <int> <int> <int>
    # 1:     1     5     5
    # 2:     2     5     5
    # 3:     3     5     5
    

    Verbose output provides a little more insight:

    DT[, .(len_a = length(A), len_b = length(B)), A, verbose = TRUE]
    # Detected that j uses these columns: [B]
    # Finding groups using forderv ... forder.c received 15 rows and 1 columns
    # 0.000s elapsed (0.000s cpu)
    # Finding group sizes from the positions (can be avoided to save RAM) ... 0.000s elapsed (0.000s cpu)
    # lapply optimization is on, j unchanged as 'list(length(A), length(B))'
    
    #        A len_a len_b
    #    <int> <int> <int>
    # 1:     1     1     5
    # 2:     2     1     5
    # 3:     3     1     5
    

    The output shows that data.table skips using the A column as a vector:

    Detected that j uses these columns: [B]

    Yet it also claims:

    j unchanged as 'list(length(A), length(B))'

    This is a little misleading as the A here is the group key instead of the A column.

    I skimmed the Aggregations section of the data.table docs and I couldn't find anything alluding to this behaviour. However, @Gusbourne has pointed out in a comment that this is mentioned in the data.table FAQ 2.10, Inside each group, why are the group variables length-1? The answer is: for efficiency and convenience.

    I think it's quite unusual to perform numeric operations on a grouping column. Nevertheless, the output is not what I would expect and to me this seems like a bug to me - at least in the prominence of this in documentation, if not in implementation. If one did not already exist, it might be worth submitting a GitHub issue. However, @Gusbourne has drawn my attention to a related issue from 2018. This seems to be considered a feature and the issue has been closed so I think the best thing to do is just avoid the undesired behaviour using the method in your question, by using .SD.