Search code examples
rbytefactors

Efficiency of factor vs. characters - object size


I come across something odd. I always thought storing data as factor variable if possible and if meaningful will result in a better storage efficiency.

But when I look at this:

object.size(c( "A", "B", "B", "0", "A", "AB", "0")) # 720 Bytes
gr <- factor(c( "A", "B", "B", "0", "A", "AB", "0"))
object.size(gr) # 336 Bytes

Then factor variables require more storage then characters. So was what I read about storage efficiency all wrong?

And is there an example to make the advantage of factors usage visible for beginners?


Solution

  • Roughly speaking, a factor is an integer vector with a levels attribute (a character vector) listing the category names and a class attribute (another character vector) telling R that it's a factor.

    A short factor tends to require more memory than a character vector of the same length, because the cost of storing the factor's attributes more than offsets the saving due to storing integers instead of strings. Here is an extreme example illustrating this point:

    x <- c("a", "b")
    f <- factor(x)
    
    class(f)
    # [1] "factor"
    
    unclass(f)
    # [1] 1 2
    # attr(,"levels")
    # [1] "a" "b"
    

    Storing f requires storing both the integer vector c(1L, 2L) and the character vector c("a", "b"). In this case, the integer vector is completely redundant, because c("a", "b") encodes all of the information we needed in the first place.

    object.size(f)
    # 568 bytes
    object.size(x)
    # 176 bytes
    

    It becomes more efficient to store factors when the levels have a large number of repetitions.

    g <- gl(2L, 1e06L, labels = c("a", "b"))
    y <- as.character(g)
    
    object.size(g)
    # 8000560 bytes
    object.size(y)
    # 16000160 bytes
    

    Some things to keep in mind:

    • Many R functions that handle categorical variables (table, split, etc.) convert character vector arguments to factors before doing anything else with them. Thus, actually doing stuff with a categorical variable almost always involves committing a factor to memory anyway.
    • Factors clearly communicate to users that the variable is categorical and not merely a sequence of strings.

    So, there are many good reasons to prefer factors, even if they are short.