Search code examples
rcompressionr-factor

In R, do factors somehow save space?


If you have a .csv file where most of the values for most variables are repeated, the final filesize of the file will not be small because there is no compression. However, if a .csv file is read into R and the appropriate variables are coerced into factors, will there be a compression benefit of some kind on the dataframe or the tibble? The repetition of factors throughout a dataframe or a tibble seems like a great opportunity to compress, but I don't know if this actually happens.

I tried searching for this online, but I didn't find answers. I'm not sure where to look for the way factors are implemented.


Solution

  • The documentation you are looking for is at the ?factor help page:

    factor returns an object of class "factor" which has a set of integer codes the length of x with a "levels" attribute of mode character and unique (!anyDuplicated(.)) entries.

    So a factor is really just an integer vector along with a mapping (stored as an attribute) between the integer number and it's label/level. Nicely space efficient if you have repeats!

    However, later we see:

    Note

    In earlier versions of R, storing character data as a factor was more space efficient if there is even a small proportion of repeats. However, identical character strings now share storage, so the difference is small in most cases. (Integer values are stored in 4 bytes whereas each reference to a character string needs a pointer of 4 or 8 bytes.)

    So, in older versions of R factors could be much more space efficient, but newer versions have optimized character vector storage, so this difference isn't so big.

    We can see the current difference:

    n = 1e6
    char = sample(letters, size = n, replace = T)
    fact = factor(char)
    
    object.size(char)
    # 8001504 bytes
    object.size(fact)
    # 4002096 bytes