Say I have a lookup table as following
dt <- data.frame(name=c("jack","jill","sam","dan"),age=c(20,14,28,13))
name age
1 jack 20
2 jill 14
3 sam 28
4 dan 13
Now I want to convert the following vectors to the vectors containing the ages of each element.
query1 <- c("jack","dan")
query2 <- c("sam")
query3 <- c("jack","sam", "dan")
I can build the following function(which I don't like) to accomplish this task,
get.age <- function(x) {
answer <- list()
for(i in 1:length(x)){
answer[[i]]<- dt[dt$name==x[i],"age"]
}
ldply(answer)$V1
}
which gets the job done like this
> get.age(query1)
[1] 20 13
> get.age(query2)
[1] 28
> get.age(query3)
[1] 20 28 13
But I don't like the solution because it uses a for loop and some dirty hack. Ideally, I would want to do it more R-like using vector operations like this(which doesn't seem to work)
> dt[dt$name==c("jack","dan"),"age"]
[1] 20 13 #worked
> dt[dt$name==c("jack","sam"),"age"]
[1] 20 # not the right answer
The following solution works, but this requires the prior knowledge of how many things I lookup.
dt[dt$name=="jack" | dt$name=="sam","age"]
[1] 20 28
I would like to know a method that can handle arbitrary size of vectors that converts the keys to elements without using for loop, if there is any
For a true lookup table, the result should be the length of the query and also deal with replication in the query. The approaches using match(...)
are the only ones that do this:
query4 <- c("jack","sam", "dan","sam","jack")
dt[match(query4,dt$name),]$age
# [1] 20 28 13 28 20
This is because match(LHS,RHS)
returns an integer vector of length(LHS) which contains the row numbers of the RHS which match the corresponding element of LHS.
The approaches based on comparison (==) will generally not work. This s because, when comparing two vectors, R tries to replicate the shorter one however many times are needed to make it the same length as the longer one, and then does an element-by-element comparison. So in the case of dt$name==query1
, for example, the RHS gets replicated twice and the comparison is between c("jack","jill","sam","dan")
and c("jack","dan","jack","dan")
.
dt$name==query1 # RHS is: c("jack","dan","jack","dan")
# [1] TRUE FALSE FALSE TRUE
dt$name==query2 # RHS is: c("sam","sam","sam","sam")
# [1] FALSE FALSE TRUE FALSE
dt$name==query3 # RHS is: c("jack","sam", "dan","jack") with warning
# [1] TRUE FALSE FALSE FALSE
# with warning: longer object length is not a multiple of shorter object length
On the other hand, using LHS %in% RHS
gives a result with length(LHS) and either T or F depending on whether that element is present in RHS.
dt$name %in% query1
# [1] TRUE FALSE FALSE TRUE
query1 %in% dt$name
# [1] TRUE TRUE
Note that it looks like df$name %in% query1
and df$name==query1
give the same result, but that's an artifact of query1
being replicated twice in the latter comparison. See, for example:
dt$name %in% query3
# [1] TRUE FALSE TRUE TRUE
dt$name == query3
# [1] TRUE FALSE FALSE FALSE