I come across something odd. I always thought storing data as factor variable if possible and if meaningful will result in a better storage efficiency.
But when I look at this:
object.size(c( "A", "B", "B", "0", "A", "AB", "0")) # 720 Bytes
gr <- factor(c( "A", "B", "B", "0", "A", "AB", "0"))
object.size(gr) # 336 Bytes
Then factor variables require more storage then characters. So was what I read about storage efficiency all wrong?
And is there an example to make the advantage of factors usage visible for beginners?
Roughly speaking, a factor is an integer vector with a levels
attribute (a character vector) listing the category names and a class
attribute (another character vector) telling R that it's a factor.
A short factor tends to require more memory than a character vector of the same length, because the cost of storing the factor's attributes more than offsets the saving due to storing integers instead of strings. Here is an extreme example illustrating this point:
x <- c("a", "b")
f <- factor(x)
class(f)
# [1] "factor"
unclass(f)
# [1] 1 2
# attr(,"levels")
# [1] "a" "b"
Storing f
requires storing both the integer vector c(1L, 2L)
and the character vector c("a", "b")
. In this case, the integer vector is completely redundant, because c("a", "b")
encodes all of the information we needed in the first place.
object.size(f)
# 568 bytes
object.size(x)
# 176 bytes
It becomes more efficient to store factors when the levels have a large number of repetitions.
g <- gl(2L, 1e06L, labels = c("a", "b"))
y <- as.character(g)
object.size(g)
# 8000560 bytes
object.size(y)
# 16000160 bytes
Some things to keep in mind:
table
, split
, etc.) convert character vector arguments to factors before doing anything else with them. Thus, actually doing stuff with a categorical variable almost always involves committing a factor to memory anyway.So, there are many good reasons to prefer factors, even if they are short.