Having a little trouble understanding what is going on here, it appears to me that both methods for ordering the data frame below are equivalent.
Our dataframe,
cols <- c("chr","id","value")
df <- data.frame(c(1:5),c("ENSG1","ENSG2","ENSG3","ENSG4","ENSG5"),runif(5,5.0,10.0))
names(df) <- cols
df <- df[sample(nrow(df)),]
df
chr id value
5 ENSG5 8.913645
2 ENSG2 6.117744
4 ENSG4 8.558403
3 ENSG3 9.625546
1 ENSG1 6.105577
Now, method 1:
df[order(df[,c("chr","id")]),]
chr id value
1 ENSG1 6.105577
2 ENSG2 6.117744
3 ENSG3 9.625546
4 ENSG4 8.558403
5 ENSG5 8.913645
NA <NA> NA
NA <NA> NA
NA <NA> NA
NA <NA> NA
NA <NA> NA
Which throws in NAs for some curious reason, while passing in df columns to order()
as in,
method 2:
df[order(df$chr,df$id),]
chr id value
1 ENSG1 6.105577
2 ENSG2 6.117744
3 ENSG3 9.625546
4 ENSG4 8.558403
5 ENSG5 8.913645
alternatively does not.
Can someone explain why method 1 and method 2 are not interchangeable?
When we look at ?order
, it's first arguments are documented as:
a sequence of numeric, complex, character or logical vectors, all of the same length, or a classed R object.
Nothing there really suggests that it would work on a data frame. A "classed R object" is a bit vague, and suggests that a data frame won't throw an error, but it certainly doesn't say "or a data frame".
The Description says:
See the examples for how to use these functions to sort data frames, etc.
When you call order
or a data frame, you can see what happens:
order(data.frame(a = 1:5, b = 5:1))
# [1] 1 10 2 9 3 8 4 7 5 6
It looks like it coerces the data frame to a vector, and orders it. Not generally very useful. This is why when you run df[order(df[,c("chr","id")]),]
you get the NA
rows. Your input data frame had 2 columns hence the order()
output had twice as many rows as the data frame.
You have already found the correct way to order a data frame, which is to give actual vectors to order
. The vectors can be individual columns of your data frame or they can be other vectors of the correct length.