Search code examples
rdataframer-rownames

Why does R have inconsistent behaviors when a non-existent rowname is retrieved from a data frame?


I wonder why two data frames a and b have different outcomes when a non-existent rowname is retrieved. For example,

a <- as.data.frame(matrix(1:3, ncol = 1, nrow = 3, dimnames = list(c("A1", "A10", "B"), "V1")))
a
    V1
A1   1
A10  2
B    3

b <- as.data.frame(matrix(4:5, ncol = 1, nrow = 2, dimnames = list(c("A10", "B"), "V1")))
b
    V1
A10  4
B    5

Let's try to get "A10", "A1", "A" from data frame a:

> a["A10", 1]
[1] 2
> a["A1", 1]
[1] 1                    # expected
> a["A", 1]
[1] NA                   # expected
> a["B", 1]
[1] 3                    # expected
> a["C", 1]
[1] NA                   # expected

Let's do the same for data frame b:

> b["A10", 1]
[1] 4
> b["A1", 1]
[1] 4                    # unexpected, should be NA
> b["A", 1]              
[1] 4                    # unexpected, should be NA
> b["B", 1]
[1] 5                    # expected
> b["C", 1]
[1] NA                   # expected

Now that a["A", 1] returns NA, why does b["A", 1] or b["A1", 1] not?

PS. R version 3.5.2


Solution

  • Synthesizing some of the comments here...


    ?`[` says:

    Unlike S (Becker et al p. 358), R never uses partial matching when extracting by [, and partial matching is not by default used by [[ (see argument exact).

    But ?`[.data.frame` says:

    Both [ and [[ extraction methods partially match row names. By default neither partially match column names, but [[ will if exact = FALSE (and with a warning if exact = NA). If you want to exact matching on row names use match, as in the examples.

    The example given there is:

    sw <- swiss[1:5, 1:4]
    sw["C", ]
    ##            Fertility Agriculture Examination Education
    ## Courtelary      80.2          17          15        12
    
    sw[match("C", row.names(sw)), ]
    ##    Fertility Agriculture Examination Education
    ## NA        NA          NA          NA        NA
    

    Meanwhile:

    as.matrix(sw)["C", ]
    ## Error in as.matrix(sw)["C", ] : subscript out of bounds
    

    So row names of matrices are matched exactly while row names of data frames are matched partially, and both behaviours are documented.

    [.data.frame is implemented in R, not C, so you can inspect the source code by printing the function. The partial matching happens here:

        if (is.character(i)) {
            rows <- attr(xx, "row.names")
            i <- pmatch(i, rows, duplicates.ok = TRUE)
        }
    

    There happens to be a recent thread on Bugzilla about partial matching of row names of data frames. (No discussion yet...)

    It is definitely surprising that [.data.frame doesn't match the behaviour of [ with respect to character indices.