Search code examples
rlistdata.tablesubset

data.table: Why double square bracket subsetting with no commas refer to columns rather than rows (i.e. DT[[3]] == DF[, 3] == DF[[3]])?


Consider the following data.table:

dt <- data.table(a = 1:5, b = 6:10, c = 11:15)
> dt
  a  b  c
1 1  6 11
2 2  7 12
3 3  8 13
4 4  9 14
5 5 10 15

From the Frequently Asked Questions vignette:

DT[3] refers to the 3rd row, but DF[3] refers to the 3rd column.

DT[3, ] == DT[3]

and likewise, from the introduction to data.table vignette, we see that

A comma after the condition in i is not required

, i.e. that in absence of commas the default is that the index is referring to the i index, the rows.

We want to access the b column programatically, so we assign the string to a variable colname <- "b". However, if we want to get the column vector b, we can use either of the following:

> dt[, ..colname][[1]]
[1]  6  7  8  9 10
> dt[,get(colname)]
[1]  6  7  8  9 10
> dt[[colname]]
[1]  6  7  8  9 10

The first two options make sense, as they are accessing a column by the j index, so they include a comma (although in a slightly cumbersome manner). But the third option is accessing a column, with no commas whatsoever. I cannot make sense of this from the introductory data.table documentation, what is happening, is this desired?


Solution

  • From the introduction to data.table vignette we see that:

    data.tables (and data.frames) are internally lists as well, with the stipulation that each element has the same length and the list has a class attribute.

    As long as j-expression returns a list, each element of the list will be converted to a column in the resulting data.table.

    Hence the extract function ([]) as a method for lists should work the same for a data.table. Indeed, we see that both a data.table and a list object can have a names attribute.

    We first create a comparable list object

    lst <- list(a = 1:5, b = 6:10, c = 11:15)
    

    And then we can inspect their attributes:

    > attributes(lst)
    $names
    [1] "a" "b" "c"
    
    > attributes(dt)
    $names
    [1] "a" "b" "c"
    
    $row.names
    [1] 1 2 3 4 5
    
    $class
    [1] "data.table" "data.frame"
    
    $.internal.selfref
    <pointer: (nil)>
    

    Then using the double squared brackets [[]], we can access an element of a list by either integer or character indices. Note also that

    for cases where you need to evaluate an expression to find the index, use x[[expr]]

    , i.e. we can use expressions that will evaluate into the integer or character index. That is what we are doing in this case, accessing an element of the list through its name attribute (a character index):

    > lst[[colname]]
    [1]  6  7  8  9 10
    > dt[[colname]]
    [1]  6  7  8  9 10
    

    The FAQ 1.2 refers to this situation and lists it as a solution but does not clarify that it is using a method of the extract function unrelated to the data.table object. I feel that this is one of many issues with calling scope that is not explained in the documentation that makes the adoption of data.table harder in the beginning.