I use the following (pseudo) code to import huge CSV files in R when only a rather small portion of the data in the file has to be treated/imported and other rows should simply be ignored. I don't put data in-memory but in a SQLite database which is created by the program.
library(sqldf)
suppressWarnings(read.csv.sql(
file = input_file_path,
sql = "
create table mytable as
select
. . .
from
file
where
. . .
",
header = FALSE,
sep = "|",
eol = "\n",
`field.types` = list( . . . )
dbname = sqlite_db_path,
drv = "SQLite"
))
If I understand the documentation of read.csv.sql
correctly, the WHERE
clause in the CREATE TABLE
statement above, guarantees that only rows satisfying the condition of the WHERE clause will be inserted in the created table mytable
. However, last night during a test I found somewhat a strange behaviour in my program. I had to import a big CSV file (more than 100 000 000 rows), but only a set of 10 000 rows was the target and all other rows had to be ignored. The above program did the job and at the end, the created table mytable
indicated in the code above, had only 10 000 rows as expected which proves that the condition in the WHERE
clause was taken into account. However, when I checked the size of the created SQLite database, it was abnormally huge (more than 25 GB). That cannot possibly be the size of 10 000 rows.
Therefore, I'm confused. What happened during the import process? Why the SQLite database file became so huge depsite the fact that only 10 000 rows were inserted in mytable
and everything else was ignored? Is this the normal behaviour or it's me who is not using correctly read.csV.sql
?
Thanks to @G. Grothendieck for his clarification :
It first uses dbWriteTable to write the file to the database. Then it operates on that. You co consider using grep/findstr/xsv to extract out the rows of interest piping that direclty into R. read.table(pipe("grep ..."), sep = ",") or in Windows findstr or the xsv or other csv utility and use that instead of grep. –