Search code examples
rsqlitersqlite

RSQLite: How to increase number of columns and parameters?


I often work with big CSV files (>50GB) with >10,000 columns. I'm thinking of reading the information into RSQLite so that I can do easy querying and subsetting.

ISSUE

SQLite has a limit of 2000 fields and 999 host parameters in a single SQL (see SQLite website). According to this, there is a way to change these parameters for SQLite (i.e. by re-compiling SQLite from source).

QUESTION

How do I make the above changes if I'm using RSQLite package? The funny thing is that according to the package's changelog, the default values have already been increased to 30,000 and 40,000 respectively as early as ver 0.11.0 back in 2011-12-01. But when I test with dataframes with 10,000 columns, RSQLite v2.1.1 still threw an error.

What am I missing here? If I need to re-compile SQLite to change the values, how do I get RSQLite to use that new version?

Thanks in advance!


Solution

  • I have reduced the limits on the number of columns to the defaults in RSQLite 1.1:

    Compilation limits SQLITE_MAX_VARIABLE_NUMBER and SQLITE_MAX_COLUMN have been reset to the defaults. The documentation suggests setting to such high values is a bad idea.

    SQLite stores the data on disk in rows. Querying a single column will require reading the entire file, or a substantial subset thereof.

    Have you considered DuckDB? It's a new

    embedded database designed to execute analytical SQL queries fast while embedded in another process. It is designed to be easy to install and easy to use. ... DuckDB has bindings for C/C++, Python and R.

    (Emphasis mine.)

    I'm told that it organizes data internally in combined row-column order: querying entire columns will only require reading a tiny fraction of the file, but returning entire rows (after filtering) is still fast. You can access a DuckDB database through DBI and dbplyr.

    CAVEAT: I'm not aware if there is a similar limitation on the number of columns, but I would be greatly surprised if the limit is much smaller than 2³¹. Let me know if it works for you.