I run across this often enough that I figure there has to be a good idiom for it. Suppose I have a data.frame with a bunch of attributes, including "product." I also have a key which translates products to brand + size. Product codes 1-3 are Tylenol, 4-6 are Advil, 7-9 are Bayer, 10-12 are Generic.
What's the fastest (in terms of human time) way to code this up?
I tend to use nested ifelse
's if there are 3 or fewer categories, and type out the data table and merge it in if there are more than 3. Any better ideas? Stata has a recode
command that is pretty nifty for this sort of thing, although I believe it promotes data-code intermixing a little too much.
dat <- structure(list(product = c(11L, 11L, 9L, 9L, 6L, 1L, 11L, 5L,
7L, 11L, 5L, 11L, 4L, 3L, 10L, 7L, 10L, 5L, 9L, 8L)), .Names = "product", row.names = c(NA,
-20L), class = "data.frame")
One could use a list as an associative array to define the brand -> product code
mapping, i.e.:
brands <- list(Tylenol=1:3, Advil=4:6, Bayer=7:9, Generic=10:12)
Once you have this, you can then either invert this to create a product code -> brand
list (could take a lot of memory), or just use a search function:
find.key <- function(x, li, default=NA) {
ret <- rep.int(default, length(x))
for (key in names(li)) {
ret[x %in% li[[key]]] <- key
}
return(ret)
}
I'm sure there are better ways of writing this function (the for
loop is annoying me!), but at least it is vectorised, so it only requires a single pass through the list.
Using it would be something like:
> dat$brand <- find.key(dat$product, brands)
> dat
product brand
1 11 Generic
2 11 Generic
3 9 Bayer
4 9 Bayer
5 6 Advil
6 1 Tylenol
7 11 Generic
8 5 Advil
9 7 Bayer
10 11 Generic
11 5 Advil
12 11 Generic
13 4 Advil
14 3 Tylenol
15 10 Generic
16 7 Bayer
17 10 Generic
18 5 Advil
19 9 Bayer
20 8 Bayer
The recode
and levels<-
solutions are very nice, but they are also significantly slower than this one (and once you have find.key
this is easier-for-humans than recode
and on par with the levels<-
):
> microbenchmark(
recode=recode(dat$product,recodes="1:3='Tylenol';4:6='Advil';7:9='Bayer';10:12='Generic'"),
find.key=find.key(dat$product, brands),
levels=`levels<-`(factor(dat$product),brands))
Unit: microseconds
expr min lq median uq max
1 find.key 64.325 69.9815 76.8950 83.8445 221.748
2 levels 240.535 248.1470 274.7565 306.8490 1477.707
3 recode 1636.039 1683.4275 1730.8170 1855.8320 3095.938
(I can't get the switch
version to benchmark properly, but it appears to be faster than all of the above, although it is even worse-for-humans than the recode
solution.)