Search code examples
rprocessing-efficiency

Chaining factors for lookup - is this the most efficient way?


I have two factors which I'm using as lookup-tables:

iState <- list("A" = "Alaska", "T" = "Texas", "G" = "Georgia")    
sCap <- list("Alaska" = "Juneau", "Texas" = "Austin", "Georgia" = "Atlanta")

And a vector to lookup:

foo <- c("T", "G", "A", "B", NA)

This code chains them together and gives me the lookup I want:

sCap[iState[foo] %>% as.character() %>%  na_if("NULL") ] %>% as.character() %>%  na_if("NULL")
# [1] "Austin"  "Atlanta" "Juneau"  NA        NA      

Is this the most execution-time-efficient way to chain these factors together? Or is there a better way?


Solution

  • You can do a lot better if you use lookup vectors instead of lookup lists. Basically, I changed list to c(), and then cut out all the as.character bits.

    vState <- c("A" = "Alaska", "T" = "Texas", "G" = "Georgia")    
    vCap <- c("Alaska" = "Juneau", "Texas" = "Austin", "Georgia" = "Atlanta")
    
    vCap[vState[foo]]
    

    Benchmarking methods so far:

    microbenchmark::microbenchmark(
      recode = foo %>%
        dplyr::recode(!!!iState, .default = NA_character_) %>%
        dplyr::recode(!!!sCap, .default = NA_character_),
      lists = sCap[iState[foo] %>% as.character() %>%  na_if("NULL") ] %>% as.character() %>%  na_if("NULL"),
      lists_no_pipe = na_if(as.character(sCap[na_if(as.character(iState[foo]), "NULL")]), "NULL"),
      vectors = unname(vCap[vState[foo]])
    )
    # Unit: microseconds
    #           expr   min     lq    mean median     uq   max neval
    #         recode 227.1 244.05 305.203 268.05 319.55 591.1   100
    #          lists 182.2 198.85 244.964 222.10 254.20 562.6   100
    #  lists_no_pipe  11.4  13.25  17.726  15.45  18.70  64.5   100
    #        vectors   2.5   3.85   5.269   4.90   6.40  12.9   100
    

    If you want things to be as fast as possible, don't use %>% - it's extra overhead. If you are doing complicated things, the extra microseconds from piping don't really matter. But in this case, the operations being done are already so quick that the few microseconds of piping actually account for a significant percentage of the execution time.

    You may be able to go even faster--especially if your look-up tables are large, by using a join to a keyed data.table instead.