Search code examples
rr-factorrecode

Idiom for ifelse-style recoding for multiple categories


I run across this often enough that I figure there has to be a good idiom for it. Suppose I have a data.frame with a bunch of attributes, including "product." I also have a key which translates products to brand + size. Product codes 1-3 are Tylenol, 4-6 are Advil, 7-9 are Bayer, 10-12 are Generic.

What's the fastest (in terms of human time) way to code this up?

I tend to use nested ifelse's if there are 3 or fewer categories, and type out the data table and merge it in if there are more than 3. Any better ideas? Stata has a recode command that is pretty nifty for this sort of thing, although I believe it promotes data-code intermixing a little too much.

dat <- structure(list(product = c(11L, 11L, 9L, 9L, 6L, 1L, 11L, 5L, 
7L, 11L, 5L, 11L, 4L, 3L, 10L, 7L, 10L, 5L, 9L, 8L)), .Names = "product", row.names = c(NA, 
-20L), class = "data.frame")

Solution

  • One could use a list as an associative array to define the brand -> product code mapping, i.e.:

    brands <- list(Tylenol=1:3, Advil=4:6, Bayer=7:9, Generic=10:12)
    

    Once you have this, you can then either invert this to create a product code -> brand list (could take a lot of memory), or just use a search function:

    find.key <- function(x, li, default=NA) {
        ret <- rep.int(default, length(x))
        for (key in names(li)) {
            ret[x %in% li[[key]]] <- key
        }
        return(ret)
    }
    

    I'm sure there are better ways of writing this function (the for loop is annoying me!), but at least it is vectorised, so it only requires a single pass through the list.

    Using it would be something like:

    > dat$brand <- find.key(dat$product, brands)
    > dat
       product   brand
    1       11 Generic
    2       11 Generic
    3        9   Bayer
    4        9   Bayer
    5        6   Advil
    6        1 Tylenol
    7       11 Generic
    8        5   Advil
    9        7   Bayer
    10      11 Generic
    11       5   Advil
    12      11 Generic
    13       4   Advil
    14       3 Tylenol
    15      10 Generic
    16       7   Bayer
    17      10 Generic
    18       5   Advil
    19       9   Bayer
    20       8   Bayer
    

    The recode and levels<- solutions are very nice, but they are also significantly slower than this one (and once you have find.key this is easier-for-humans than recode and on par with the levels<-):

    > microbenchmark(
         recode=recode(dat$product,recodes="1:3='Tylenol';4:6='Advil';7:9='Bayer';10:12='Generic'"), 
         find.key=find.key(dat$product, brands),
         levels=`levels<-`(factor(dat$product),brands))
    Unit: microseconds
          expr      min        lq    median        uq      max
    1 find.key   64.325   69.9815   76.8950   83.8445  221.748
    2   levels  240.535  248.1470  274.7565  306.8490 1477.707
    3   recode 1636.039 1683.4275 1730.8170 1855.8320 3095.938
    

    (I can't get the switch version to benchmark properly, but it appears to be faster than all of the above, although it is even worse-for-humans than the recode solution.)