Search code examples
rsqliteparquetsqldfapache-arrow

Export a SQLite table to Apache parquet without creating a dataframe


I have multiple huge CSV files that I have to export based on Apache Parquet format and split them into smaller files based on multiple criteria/keys (= column values). As I understand Apache arrow is the R package allowing to work with Apache parquet files.

I work in a shared lab environment and given the limited RAM memory (compared to the number of users who work simultaneously in this same environment) we are advised to create our dataframes in local SQLite databases rather than importing them in-memory (into RAM).

The following pseudo-code shows how I import my CSV files in my local SQLite database. In the following code I use sqldf and tidyverse packages.

input_file_path <- "D:/tmp/mydata.csv"
db_file_path <- "D:/tmp/db_tmp_sqlite.db"
unlink(db_file_path)
sqldf(str_c("attach '", db_file_path, "' as new"))
sqldf(read.csv.sql(
    file = input_file_path,
    sql = "
        create table mytable as
        select
            . . .
        from
            file
    ",
    `field.types` = list(
      . . .
    ),
    ##
    header = TRUE,
    sep = ",",
    eol = "\n",
    dbname = db_file_path,
    drv = "SQLite"
))

This works well as expected, my table is created and I can run all required SQL queries, in particular adding supplementary variables (columns in my tables) which will be used later as keys to export my tables into Apache Parquet format. However, based on Apache Arrow for R Cheatsheet, the function write_dataset that allows to export my data based on Apache Parquet format, requires a dataframe.

And that is precisely my problem because a dataframe in R is in-memory whereas my data as I explained earlier are in a SQLite local database. This means that first I have to do a SELECT to export the whole data into RAM, something like

df <- sqldf("select * from mytable", dbname = ...)

And only then I'd be able to use write_dataset with the created df dataframe as its first argument in order to export and split my data based on Apache Parquet format. But this is not what I wanted to do. The whole point was to put the data in SQLite and not in-memory (RAM) given the existing resource limitations (lack of memory) in our shared environment.

Is there anyway to convert to Apache Parquet directly from SQLite within a R program, without first putting the whole data in a dataframe before the export, or I'm trying to do something which is simply not possible?


Solution

  • DuckDB has several great features, including the ability to both import and export CSV and parquet formats natively without affecting R memory.

    TL;DR

    con <- DBI::dbConnect(duckdb::duckdb(), dbdir = ":memory:")
    DBI::dbExecute(con, "copy (select * from read_csv_auto('quux.csv', SAMPLE_SIZE=-1)) to 'quux3.pq' (format parquet)")
    

    And that is all. The data is never imported into R. (Now, whether duckdb can do it itself without exhausting memory is another issue I did not validate locally ...)

    Caveat emptor: before you trust this blindly, however, I strongly urge you to do some validation of classes. Most of it can be done easily in a "lazy" fashion using duckdb without having to load the whole frame into R. I encourage you to read more of its documentation for querying CSV/parquet files natively (without loading into R).

    Methodology

    To make a comparison of the two methods (via a data.frame which you don't want to do, and via duckdb), we'll use "RSS" (from ps::ps_memory_info()) to indicate the current R process memory usage. From ?ps::ps_memory_info:

            * 'rss': "Resident Set Size", this is the non-swapped physical
              memory a process has used (bytes). On UNIX it matches "top"‘s
              'RES' column (see doc). On Windows this is an alias for
              'wset' field and it matches "Memory" column of 'taskmgr.exe'.
    

    Though an imperfect measure of the true impact to R, it does indicate a significantly smaller impact on R when using DuckDB.

    Also, each method is done in a fresh instance of R --vanilla. No .Rprofile or site-init files are loaded. The code you see is the code that is executed, nothing more.

    In R via data.frame

    Sys.getpid()
    # [1] 20860
    
    file.info("quux.csv")["size"] / (1024^2) # MBs
    #              size
    # quux.csv 299.3079
    mem1 <- ps::ps_memory_info()["rss"]
    dat <- read.csv("quux.csv")
    mem2 <- ps::ps_memory_info()["rss"]
    arrow::write_parquet(dat, "quux1.pq")
    mem3 <- ps::ps_memory_info()["rss"]
    c(mem1, mem2, mem3, diff = mem3 - mem1) / (1024^2)
    #        rss        rss        rss   diff.rss 
    #   57.70703 1218.55859 1548.54688 1490.83984 
    

    This indicates R is 1490MB larger after reading in the full data. (FYI, data.table::fread instead of read.csv results in only 408MB of memory gain, same austere conditions. I'm not trying to optimize this part, though :-)

    (FYI, these numbers vary for me from run-to-run and are likely to be different based on other factors outside the scope of this answer. My laptop has 64GB of RAM, it might not be comparable to exactly what you see.)

    DuckDB, read from CSV, write to parquet

    Sys.getpid()
    # [1] 32485
    
    mem1 <- ps::ps_memory_info()["rss"]
    con <- DBI::dbConnect(duckdb::duckdb(), dbdir = ":memory:")
    DBI::dbExecute(con, "copy (select * from read_csv_auto('quux.csv')) to 'quux2.pq' (format parquet)")
    # [1] 1000207
    mem2 <- ps::ps_memory_info()["rss"]
    c(mem1, mem2, diff=mem2 - mem1) / (1024^2)
    #      rss      rss diff.rss 
    # 63.23828 86.35938 23.12109 
    

    showing only 23MB in this process.

    Comparing the resulting files.

    file.info(list.files(pattern = "quux.*"))["size"] /  (1024^2)
    #               size
    # quux.csv 299.30786
    # quux1.pq  51.69008
    # quux2.pq  66.84857
    

    The larger file is due to the differences in class noted below. My guess is that if we force some of the character columns to be logical, then its file-size might be reduced.

    A little more in-depth look at the contents:

    ds1 <- arrow::open_dataset("quux1.pq")
    ds2 <- arrow::open_dataset("quux2.pq")
    identical(names(ds1), names(ds2))
    # [1] TRUE
    
    data.frame(
      ds1 = sapply(head(ds1, 1), function(z) class(z)[1]),
      ds2 = sapply(head(ds2, 1), function(z) class(z)[1])
    )
    #           ds1       ds2
    # V1  character character
    # V2    integer   integer
    # V3  character character
    # V4    integer   integer
    # V5    logical character
    # V6    integer   integer
    # V7  character   POSIXct
    # V8    logical character
    # V9    numeric   numeric
    # V10   numeric   numeric
    # V11   numeric   integer
    # V12   integer   integer
    # V13   integer   integer
    # V14   integer   integer
    # V15   numeric   numeric
    # V16   integer   integer
    # V17   integer   integer
    # V18   numeric   numeric
    # V19   numeric   numeric
    # V20   logical character
    # V21   numeric   numeric
    # V22   numeric   numeric
    # V23   numeric   numeric
    # V24   integer   integer
    # V25   logical character
    # V26   integer   integer
    # V27   integer   integer
    # V28   integer   integer
    # V29   integer   integer
    # V30   logical character
    # V31   logical character
    # V32   numeric   numeric
    # V33   logical character
    # V34   logical character
    # V35   logical character
    # V36   logical character
    # V37   logical character
    # V38   logical character
    # V39 character   POSIXct
    # V40   logical character
    # V41   logical character
    # V42   numeric   integer
    # V43   logical character
    # V44   logical character
    # V45   logical character
    # V46   logical character
    # V47   numeric   numeric
    # V48   logical character
    # V49   logical character
    # V50   logical character
    # V51   logical character
    # V52   logical character
    # V53   logical character
    # V54   logical character
    # V55   logical character
    # V56   logical character
    # V57   logical character
    

    Some interesting things to deduce from this:

    • two fields are timestamps, and the duckdb method correctly identified them, parsed them, and stored as numeric timestamps; since I didn't tell R the column classes explicitly, it defaulted to character for them;
    • all of the columns that are logical in ds1 and character in ds2 are all null (sorry, it was the data I had); the fact that they are different classes indicate that duckdb defaults to string-like nulls instead of "bit", may or may not be a factor for you;
    • only two columns were classified as numeric-vs-integer; V11 was truly integer, it's fine; the second one, V42 shows that the heuristic used for differentiating between numeric and integer missed something. The first row of V42 that contained any fractional component was on row 37159.

    Fixing data discrepancies

    Column V42 indicates that we need to be very cognizant of what is going in and out of that parquet generator. My guess is that it's in the "CSV Import" step, so looking at CSV Loading suggests the need to change the SAMPLE_SIZE. While relatively inefficient, I'll use -1 indicating that it needs to look at all values in a column to determine its class. Slower, yes, but also safer.

    Validation of this assumption:

    > str(DBI::dbGetQuery(con, "select * from read_csv_auto('quux.csv') limit 5")[c("V11","V42")])
    'data.frame':   5 obs. of  2 variables:
     $ V11: int  4407 4408 4408 4407 4408
     $ V42: int  26 25 23 21 3
    > str(DBI::dbGetQuery(con, "select * from read_csv_auto('quux.csv', SAMPLE_SIZE=-1) limit 5")[c("V11","V42")])
    'data.frame':   5 obs. of  2 variables:
     $ V11: int  4407 4408 4408 4407 4408
     $ V42: num  26 25 23 21 3
    

    Indeed, V11 is still good, and V42 changes from int to num.

    After rerunning with this new parameter,

    DBI::dbExecute(con, "copy (select * from read_csv_auto('quux.csv', SAMPLE_SIZE=-1)) to 'quux3.pq' (format parquet)")
    

    offline validation confirmed that all values are correct.