I have multiple huge CSV files that I have to export based on Apache Parquet
format and split them into smaller files based on multiple criteria/keys (= column values). As I understand Apache arrow
is the R package allowing to work with Apache parquet
files.
I work in a shared lab environment and given the limited RAM memory (compared to the number of users who work simultaneously in this same environment) we are advised to create our dataframes in local SQLite databases rather than importing them in-memory (into RAM).
The following pseudo-code shows how I import my CSV files in my local SQLite database. In the following code I use sqldf
and tidyverse
packages.
input_file_path <- "D:/tmp/mydata.csv"
db_file_path <- "D:/tmp/db_tmp_sqlite.db"
unlink(db_file_path)
sqldf(str_c("attach '", db_file_path, "' as new"))
sqldf(read.csv.sql(
file = input_file_path,
sql = "
create table mytable as
select
. . .
from
file
",
`field.types` = list(
. . .
),
##
header = TRUE,
sep = ",",
eol = "\n",
dbname = db_file_path,
drv = "SQLite"
))
This works well as expected, my table is created and I can run all required SQL queries, in particular adding supplementary variables (columns in my tables) which will be used later as keys to export my tables into Apache Parquet format. However, based on Apache Arrow for R Cheatsheet, the function write_dataset
that allows to export my data based on Apache Parquet
format, requires a dataframe.
And that is precisely my problem because a dataframe in R is in-memory whereas my data as I explained earlier are in a SQLite local database. This means that first I have to do a SELECT to export the whole data into RAM, something like
df <- sqldf("select * from mytable", dbname = ...)
And only then I'd be able to use write_dataset
with the created df
dataframe as its first argument in order to export and split my data based on Apache Parquet
format. But this is not what I wanted to do. The whole point was to put the data in SQLite and not in-memory (RAM) given the existing resource limitations (lack of memory) in our shared environment.
Is there anyway to convert to Apache Parquet directly from SQLite within a R program, without first putting the whole data in a dataframe before the export, or I'm trying to do something which is simply not possible?
DuckDB has several great features, including the ability to both import and export CSV and parquet formats natively without affecting R memory.
con <- DBI::dbConnect(duckdb::duckdb(), dbdir = ":memory:")
DBI::dbExecute(con, "copy (select * from read_csv_auto('quux.csv', SAMPLE_SIZE=-1)) to 'quux3.pq' (format parquet)")
And that is all. The data is never imported into R. (Now, whether duckdb can do it itself without exhausting memory is another issue I did not validate locally ...)
Caveat emptor: before you trust this blindly, however, I strongly urge you to do some validation of classes. Most of it can be done easily in a "lazy" fashion using duckdb
without having to load the whole frame into R. I encourage you to read more of its documentation for querying CSV/parquet files natively (without loading into R).
To make a comparison of the two methods (via a data.frame
which you don't want to do, and via duckdb
), we'll use "RSS" (from ps::ps_memory_info()
) to indicate the current R process memory usage. From ?ps::ps_memory_info
:
* 'rss': "Resident Set Size", this is the non-swapped physical
memory a process has used (bytes). On UNIX it matches "top"‘s
'RES' column (see doc). On Windows this is an alias for
'wset' field and it matches "Memory" column of 'taskmgr.exe'.
Though an imperfect measure of the true impact to R, it does indicate a significantly smaller impact on R when using DuckDB.
Also, each method is done in a fresh instance of R --vanilla
. No .Rprofile
or site-init files are loaded. The code you see is the code that is executed, nothing more.
Sys.getpid()
# [1] 20860
file.info("quux.csv")["size"] / (1024^2) # MBs
# size
# quux.csv 299.3079
mem1 <- ps::ps_memory_info()["rss"]
dat <- read.csv("quux.csv")
mem2 <- ps::ps_memory_info()["rss"]
arrow::write_parquet(dat, "quux1.pq")
mem3 <- ps::ps_memory_info()["rss"]
c(mem1, mem2, mem3, diff = mem3 - mem1) / (1024^2)
# rss rss rss diff.rss
# 57.70703 1218.55859 1548.54688 1490.83984
This indicates R is 1490MB larger after reading in the full data. (FYI, data.table::fread
instead of read.csv
results in only 408MB of memory gain, same austere conditions. I'm not trying to optimize this part, though :-)
(FYI, these numbers vary for me from run-to-run and are likely to be different based on other factors outside the scope of this answer. My laptop has 64GB of RAM, it might not be comparable to exactly what you see.)
Sys.getpid()
# [1] 32485
mem1 <- ps::ps_memory_info()["rss"]
con <- DBI::dbConnect(duckdb::duckdb(), dbdir = ":memory:")
DBI::dbExecute(con, "copy (select * from read_csv_auto('quux.csv')) to 'quux2.pq' (format parquet)")
# [1] 1000207
mem2 <- ps::ps_memory_info()["rss"]
c(mem1, mem2, diff=mem2 - mem1) / (1024^2)
# rss rss diff.rss
# 63.23828 86.35938 23.12109
showing only 23MB in this process.
file.info(list.files(pattern = "quux.*"))["size"] / (1024^2)
# size
# quux.csv 299.30786
# quux1.pq 51.69008
# quux2.pq 66.84857
The larger file is due to the differences in class noted below. My guess is that if we force some of the character
columns to be logical
, then its file-size might be reduced.
A little more in-depth look at the contents:
ds1 <- arrow::open_dataset("quux1.pq")
ds2 <- arrow::open_dataset("quux2.pq")
identical(names(ds1), names(ds2))
# [1] TRUE
data.frame(
ds1 = sapply(head(ds1, 1), function(z) class(z)[1]),
ds2 = sapply(head(ds2, 1), function(z) class(z)[1])
)
# ds1 ds2
# V1 character character
# V2 integer integer
# V3 character character
# V4 integer integer
# V5 logical character
# V6 integer integer
# V7 character POSIXct
# V8 logical character
# V9 numeric numeric
# V10 numeric numeric
# V11 numeric integer
# V12 integer integer
# V13 integer integer
# V14 integer integer
# V15 numeric numeric
# V16 integer integer
# V17 integer integer
# V18 numeric numeric
# V19 numeric numeric
# V20 logical character
# V21 numeric numeric
# V22 numeric numeric
# V23 numeric numeric
# V24 integer integer
# V25 logical character
# V26 integer integer
# V27 integer integer
# V28 integer integer
# V29 integer integer
# V30 logical character
# V31 logical character
# V32 numeric numeric
# V33 logical character
# V34 logical character
# V35 logical character
# V36 logical character
# V37 logical character
# V38 logical character
# V39 character POSIXct
# V40 logical character
# V41 logical character
# V42 numeric integer
# V43 logical character
# V44 logical character
# V45 logical character
# V46 logical character
# V47 numeric numeric
# V48 logical character
# V49 logical character
# V50 logical character
# V51 logical character
# V52 logical character
# V53 logical character
# V54 logical character
# V55 logical character
# V56 logical character
# V57 logical character
Some interesting things to deduce from this:
character
for them;logical
in ds1
and character
in ds2
are all null (sorry, it was the data I had); the fact that they are different classes indicate that duckdb
defaults to string-like nulls instead of "bit", may or may not be a factor for you;numeric
-vs-integer
; V11
was truly integer, it's fine; the second one, V42
shows that the heuristic used for differentiating between numeric
and integer
missed something. The first row of V42
that contained any fractional component was on row 37159.Column V42
indicates that we need to be very cognizant of what is going in and out of that parquet generator. My guess is that it's in the "CSV Import" step, so looking at CSV Loading suggests the need to change the SAMPLE_SIZE
. While relatively inefficient, I'll use -1
indicating that it needs to look at all values in a column to determine its class. Slower, yes, but also safer.
Validation of this assumption:
> str(DBI::dbGetQuery(con, "select * from read_csv_auto('quux.csv') limit 5")[c("V11","V42")])
'data.frame': 5 obs. of 2 variables:
$ V11: int 4407 4408 4408 4407 4408
$ V42: int 26 25 23 21 3
> str(DBI::dbGetQuery(con, "select * from read_csv_auto('quux.csv', SAMPLE_SIZE=-1) limit 5")[c("V11","V42")])
'data.frame': 5 obs. of 2 variables:
$ V11: int 4407 4408 4408 4407 4408
$ V42: num 26 25 23 21 3
Indeed, V11
is still good, and V42
changes from int
to num
.
After rerunning with this new parameter,
DBI::dbExecute(con, "copy (select * from read_csv_auto('quux.csv', SAMPLE_SIZE=-1)) to 'quux3.pq' (format parquet)")
offline validation confirmed that all values are correct.