Search code examples
rsqlitesqldfread.csv

read.csv.sql inserts all rows despite the WHERE clause in its SQL clause


I use the following (pseudo) code to import huge CSV files in R when only a rather small portion of the data in the file has to be treated/imported and other rows should simply be ignored. I don't put data in-memory but in a SQLite database which is created by the program.

library(sqldf)
suppressWarnings(read.csv.sql(
    file = input_file_path,
    sql = "
       create table mytable as 
       select
           . . .
       from
           file
       where
           . . .
      ",
    header = FALSE,
    sep = "|",
    eol = "\n",
    `field.types` = list( . . . )
    dbname = sqlite_db_path,
    drv = "SQLite"
))

If I understand the documentation of read.csv.sql correctly, the WHERE clause in the CREATE TABLE statement above, guarantees that only rows satisfying the condition of the WHERE clause will be inserted in the created table mytable. However, last night during a test I found somewhat a strange behaviour in my program. I had to import a big CSV file (more than 100 000 000 rows), but only a set of 10 000 rows was the target and all other rows had to be ignored. The above program did the job and at the end, the created table mytable indicated in the code above, had only 10 000 rows as expected which proves that the condition in the WHERE clause was taken into account. However, when I checked the size of the created SQLite database, it was abnormally huge (more than 25 GB). That cannot possibly be the size of 10 000 rows.

Therefore, I'm confused. What happened during the import process? Why the SQLite database file became so huge depsite the fact that only 10 000 rows were inserted in mytable and everything else was ignored? Is this the normal behaviour or it's me who is not using correctly read.csV.sql?


Solution

  • Thanks to @G. Grothendieck for his clarification :

    It first uses dbWriteTable to write the file to the database. Then it operates on that. You co consider using grep/findstr/xsv to extract out the rows of interest piping that direclty into R. read.table(pipe("grep ..."), sep = ",") or in Windows findstr or the xsv or other csv utility and use that instead of grep. –