Problem:
I have a dataset in R of several Gb and after some cleaning, I used arrow's parquet file format instead of csv for faster reading/writing and lesser disk size on my ssd.
When I tried to load it back in a fresh instance I found out that the same dataset takes different memory usage, even if the object.size of the data.frame is the same.
Shouldn't the object loaded in memory have the same size regardless of the original format?
What's the explanation for this behavior?
Example code below:
library(dplyr)
df <- nycflights13::flights
data.table::fwrite(df, "df.csv")
arrow::write_parquet(df, "df.parquet")
#Restart R and load df. Do a garbage collection and check memory usage and object.size
df <- data.table::fread("df.csv") %>% as.data.frame()
gc()
object.size(df)
#Restart R and load df. Do a garbage collection and check memory usage and object.size
df <- arrow::read_parquet("df.parquet")
gc()
The difference here is small but not zero. With the dataset I'm working on, the difference in memory allocated is a bigger, almost double.
This discrepancy seems to come from comparing apples to pears. data.table
imports tables in a different way than read.csv
and read_csv
(from readr
). First of all it loads the data as a data.table
, meanwhile read_parquet
reads data into a tibble
. fread
further checks for integer columns and store these as integers while read_parquet
stores them as numeric
.
Looking into the objects we see clearly that df
loaded with fread
has several integer columns which are numeric
for the alternative. Lets try converting these and compare the object sizes:
df <- data.table::fread("df.csv")
df2 <- arrow::read_parquet("df.parquet")
df3 <- df2 %>%
mutate(across(names(sapply(df, is.integer))[sapply(df, is.integer)],
.fns = as.integer))
pryr::object_size(df) # 32,567,576 B
pryr::object_size(df2) # 2,700,824 B
pryr::object_size(df3) # 32,570,040 B
If we instead use object.size
these values change and df2
actually turns out to use more storage compared to df
. This indicates pryr
might not be capable of correctly locating the nuances of storage for data.table
.
Further data.table
adds additional effects in the background as it utilizes threads, which might account for the remaining discrepancy.