Search code examples
rstringtype-conversionfactors

Is there any good reason for columns to be characters instead of factors?


This may seem like a silly question, but after working with R for a couple of months, I realised I often find myself converting strings to factors as, for example, the tabulate function does not work on strings.

At this point I am contemplating simply always converting any string to a factor. But that begs the question, is there any reason not to (apart from carrying out operations on the string itself)?


Solution

  • Factors have a dual representation -- the 'label'; and underlying encoding of the level. Which of these representations is used by R can be subtle and confusing.

    One illustration of where this can be confusing is with subsetting. Here's a named vector, a character vector, and a factor with default (alphabetically ordered) levels

    x = c(foo = 1, bar = 2)
    y = c("bar", "foo")
    z = factor(y)        # default levels are "bar", "foo", i.e., alphabetical
    

    Subsetting x by y matches character value to name, but subsetting x by z uses the underlying level encoding.

    > x[y]
    bar foo 
      2   1 
    > x[z]
    foo bar 
      1   2 
    

    This can be made even more confusing because R can work in different locales (e.g., I am using en_US locale -- US English) and the collation (sort) order of different locales can be different -- default levels might be different in different locales.