Is there any good reason for columns to be characters instead of factors?

This may seem like a silly question, but after working with R for a couple of months, I realised I often find myself converting strings to factors as, for example, the tabulate function does not work on strings.

At this point I am contemplating simply always converting any string to a factor. But that begs the question, is there any reason not to (apart from carrying out operations on the string itself)?

Solution

Factors have a dual representation -- the 'label'; and underlying encoding of the level. Which of these representations is used by R can be subtle and confusing.

One illustration of where this can be confusing is with subsetting. Here's a named vector, a character vector, and a factor with default (alphabetically ordered) levels

x = c(foo = 1, bar = 2)
y = c("bar", "foo")
z = factor(y)        # default levels are "bar", "foo", i.e., alphabetical

Subsetting x by y matches character value to name, but subsetting x by z uses the underlying level encoding.

> x[y]
bar foo 
  2   1 
> x[z]
foo bar 
  1   2

This can be made even more confusing because R can work in different locales (e.g., I am using en_US locale -- US English) and the collation (sort) order of different locales can be different -- default levels might be different in different locales.