Search code examples
rfor-loophashvectorizationlookup-tables

Lookup table with the query with arbitrary length without using a for loop in R


Say I have a lookup table as following

dt <- data.frame(name=c("jack","jill","sam","dan"),age=c(20,14,28,13))

  name age
1 jack  20
2 jill  14
3  sam  28
4  dan  13

Now I want to convert the following vectors to the vectors containing the ages of each element.

query1 <- c("jack","dan")
query2 <- c("sam")
query3 <- c("jack","sam", "dan")

I can build the following function(which I don't like) to accomplish this task,

get.age <- function(x) {

  answer <- list()
  for(i in 1:length(x)){
    answer[[i]]<- dt[dt$name==x[i],"age"]
  }
  ldply(answer)$V1
}

which gets the job done like this

> get.age(query1)
[1] 20 13
> get.age(query2)
[1] 28
> get.age(query3)
[1] 20 28 13

But I don't like the solution because it uses a for loop and some dirty hack. Ideally, I would want to do it more R-like using vector operations like this(which doesn't seem to work)

> dt[dt$name==c("jack","dan"),"age"]
[1] 20 13 #worked
> dt[dt$name==c("jack","sam"),"age"]
[1] 20  # not the right answer

The following solution works, but this requires the prior knowledge of how many things I lookup.

dt[dt$name=="jack" | dt$name=="sam","age"]

[1] 20 28

I would like to know a method that can handle arbitrary size of vectors that converts the keys to elements without using for loop, if there is any


Solution

  • For a true lookup table, the result should be the length of the query and also deal with replication in the query. The approaches using match(...) are the only ones that do this:

    query4 <- c("jack","sam", "dan","sam","jack")
    dt[match(query4,dt$name),]$age
    # [1] 20 28 13 28 20
    

    This is because match(LHS,RHS) returns an integer vector of length(LHS) which contains the row numbers of the RHS which match the corresponding element of LHS.

    The approaches based on comparison (==) will generally not work. This s because, when comparing two vectors, R tries to replicate the shorter one however many times are needed to make it the same length as the longer one, and then does an element-by-element comparison. So in the case of dt$name==query1, for example, the RHS gets replicated twice and the comparison is between c("jack","jill","sam","dan") and c("jack","dan","jack","dan").

    dt$name==query1   # RHS is: c("jack","dan","jack","dan")
    # [1]  TRUE FALSE FALSE  TRUE
    dt$name==query2   # RHS is: c("sam","sam","sam","sam")
    # [1] FALSE FALSE  TRUE FALSE
    dt$name==query3   # RHS is: c("jack","sam", "dan","jack") with warning
    # [1]  TRUE FALSE FALSE FALSE
    # with warning:   longer object length is not a multiple of shorter object length
    

    On the other hand, using LHS %in% RHS gives a result with length(LHS) and either T or F depending on whether that element is present in RHS.

    dt$name %in% query1
    # [1]  TRUE FALSE FALSE  TRUE
    query1 %in% dt$name
    # [1] TRUE TRUE
    

    Note that it looks like df$name %in% query1 and df$name==query1 give the same result, but that's an artifact of query1 being replicated twice in the latter comparison. See, for example:

    dt$name %in% query3
    # [1]  TRUE FALSE  TRUE  TRUE
    dt$name  ==  query3
    # [1]  TRUE FALSE FALSE FALSE