If you have a .csv
file where most of the values for most variables are repeated, the final filesize of the file will not be small because there is no compression. However, if a .csv
file is read into R and the appropriate variables are coerced into factors, will there be a compression benefit of some kind on the dataframe or the tibble? The repetition of factors throughout a dataframe or a tibble seems like a great opportunity to compress, but I don't know if this actually happens.
I tried searching for this online, but I didn't find answers. I'm not sure where to look for the way factors are implemented.
The documentation you are looking for is at the ?factor
help page:
factor
returns an object of class "factor" which has a set of integer codes the length of x with a "levels" attribute of mode character and unique (!anyDuplicated(.)
) entries.
So a factor is really just an integer
vector along with a mapping (stored as an attribute) between the integer number and it's label/level. Nicely space efficient if you have repeats!
However, later we see:
Note
In earlier versions of R, storing character data as a factor was more space efficient if there is even a small proportion of repeats. However, identical character strings now share storage, so the difference is small in most cases. (Integer values are stored in 4 bytes whereas each reference to a character string needs a pointer of 4 or 8 bytes.)
So, in older versions of R factors could be much more space efficient, but newer versions have optimized character
vector storage, so this difference isn't so big.
We can see the current difference:
n = 1e6
char = sample(letters, size = n, replace = T)
fact = factor(char)
object.size(char)
# 8001504 bytes
object.size(fact)
# 4002096 bytes