Search code examples
rcolumnsorting

Why does order() in R generate NAs when passing in a subsetted dataframe?


Having a little trouble understanding what is going on here, it appears to me that both methods for ordering the data frame below are equivalent.

Our dataframe,

cols <- c("chr","id","value")
df <-   data.frame(c(1:5),c("ENSG1","ENSG2","ENSG3","ENSG4","ENSG5"),runif(5,5.0,10.0))
names(df) <- cols
df <- df[sample(nrow(df)),]
df

chr    id    value
5      ENSG5 8.913645
2      ENSG2 6.117744
4      ENSG4 8.558403
3      ENSG3 9.625546
1      ENSG1 6.105577

Now, method 1:

df[order(df[,c("chr","id")]),]

chr    id    value
1      ENSG1 6.105577
2      ENSG2 6.117744
3      ENSG3 9.625546
4      ENSG4 8.558403
5      ENSG5 8.913645
NA    <NA>       NA
NA    <NA>       NA
NA    <NA>       NA
NA    <NA>       NA
NA    <NA>       NA

Which throws in NAs for some curious reason, while passing in df columns to order() as in,

method 2:

df[order(df$chr,df$id),]

chr    id    value
1      ENSG1 6.105577
2      ENSG2 6.117744
3      ENSG3 9.625546
4      ENSG4 8.558403
5      ENSG5 8.913645

alternatively does not.

Can someone explain why method 1 and method 2 are not interchangeable?


Solution

  • When we look at ?order, it's first arguments are documented as:

    a sequence of numeric, complex, character or logical vectors, all of the same length, or a classed R object.

    Nothing there really suggests that it would work on a data frame. A "classed R object" is a bit vague, and suggests that a data frame won't throw an error, but it certainly doesn't say "or a data frame".

    The Description says:

    See the examples for how to use these functions to sort data frames, etc.

    When you call order or a data frame, you can see what happens:

    order(data.frame(a = 1:5, b = 5:1))
    # [1]  1 10  2  9  3  8  4  7  5  6
    

    It looks like it coerces the data frame to a vector, and orders it. Not generally very useful. This is why when you run df[order(df[,c("chr","id")]),] you get the NA rows. Your input data frame had 2 columns hence the order() output had twice as many rows as the data frame.

    You have already found the correct way to order a data frame, which is to give actual vectors to order. The vectors can be individual columns of your data frame or they can be other vectors of the correct length.