Search code examples
rr-faq

Arithmetic operations on R factors


I have an R dataframe and I'm trying to subtract one column from another. I extract the columns using the $ operator but the class of the columns is 'factor' and R won't perform arithmetic operations on factors. Are there special functions to do this?


Solution

  • If you really want the levels of the factor to be used, you're either doing something very wrong or too clever for its own good.

    If what you have is a factor containing numbers stored in the levels of the factor, then you want to coerce it to numeric first using as.numeric(as.character(...)):

    dat <- data.frame(f=as.character(runif(10)))
    

    You can see the difference between accessing the factor indices and assigning the factor contents here:

    > as.numeric(dat$f)
     [1]  9  7  2  1  4  6  5  3 10  8
    > as.numeric(as.character(dat$f))
     [1] 0.6369432 0.4455214 0.1204000 0.0336245 0.2731787 0.4219241 0.2910194
     [8] 0.1868443 0.9443593 0.5784658
    

    Timings vs. an alternative approach which only does the conversion on the levels shows it's faster if levels are not unique to each element:

    dat <- data.frame( f = sample(as.character(runif(10)),10^4,replace=TRUE) )
    library(microbenchmark)
    microbenchmark(
      as.numeric(as.character(dat$f)),
      as.numeric( levels(dat$f) )[dat$f] ,
      as.numeric( levels(dat$f)[dat$f] ),
      times=50
      )
    
                                  expr     min      lq  median      uq     max
    1  as.numeric(as.character(dat$f)) 7835865 7869228 7919699 7998399 9576694
    2 as.numeric(levels(dat$f))[dat$f]  237814  242947  255778  270321  371263
    3 as.numeric(levels(dat$f)[dat$f]) 7817045 7905156 7964610 8121583 9297819
    

    Therefore, if length(levels(dat$f)) < length(dat$f), use as.numeric(levels(dat$f))[dat$f] for a substantial speed gain.

    If length(levels(dat$f)) is approximately equal to length(dat$f), there is no speed gain:

    dat <- data.frame( f = as.character(runif(10^4) ) )
    library(microbenchmark)
    microbenchmark(
      as.numeric(as.character(dat$f)),
      as.numeric( levels(dat$f) )[dat$f] ,
      as.numeric( levels(dat$f)[dat$f] ),
      times=50
      )
    
                                  expr     min      lq  median      uq      max
    1  as.numeric(as.character(dat$f)) 7986423 8036895 8101480 8202850 12522842
    2 as.numeric(levels(dat$f))[dat$f] 7815335 7866661 7949640 8102764 15809456
    3 as.numeric(levels(dat$f)[dat$f]) 7989845 8040316 8122012 8330312 10420161