r memory memory-management parquet file-format

R: Same object saved in different file formats and then re-imported takes different memory usage

Problem:

I have a dataset in R of several Gb and after some cleaning, I used arrow's parquet file format instead of csv for faster reading/writing and lesser disk size on my ssd.

When I tried to load it back in a fresh instance I found out that the same dataset takes different memory usage, even if the object.size of the data.frame is the same.

Shouldn't the object loaded in memory have the same size regardless of the original format?

What's the explanation for this behavior?

Example code below:

library(dplyr)

df <- nycflights13::flights


data.table::fwrite(df, "df.csv")

arrow::write_parquet(df, "df.parquet")

#Restart R and load df. Do a garbage collection and check memory usage and object.size

df <- data.table::fread("df.csv") %>% as.data.frame()

gc()

object.size(df)

#Restart R and load df. Do a garbage collection and check memory usage and object.size

df <- arrow::read_parquet("df.parquet")

gc()

The difference here is small but not zero. With the dataset I'm working on, the difference in memory allocated is a bigger, almost double.

Solution

This discrepancy seems to come from comparing apples to pears. data.table imports tables in a different way than read.csv and read_csv (from readr). First of all it loads the data as a data.table, meanwhile read_parquet reads data into a tibble. fread further checks for integer columns and store these as integers while read_parquet stores them as numeric.

Looking into the objects we see clearly that df loaded with fread has several integer columns which are numeric for the alternative. Lets try converting these and compare the object sizes:

df <- data.table::fread("df.csv")
df2 <- arrow::read_parquet("df.parquet")
df3 <- df2 %>% 
  mutate(across(names(sapply(df, is.integer))[sapply(df, is.integer)], 
                .fns = as.integer))
pryr::object_size(df)  # 32,567,576 B
pryr::object_size(df2) # 2,700,824 B
pryr::object_size(df3) # 32,570,040 B

If we instead use object.size these values change and df2 actually turns out to use more storage compared to df. This indicates pryr might not be capable of correctly locating the nuances of storage for data.table.

Further data.table adds additional effects in the background as it utilizes threads, which might account for the remaining discrepancy.