Search code examples
rsqlitedbplyr

How to create non-clustered indexes for querying and collecting data from an SQLite DB in R for plotting?


I have a .csv file that contains 105M rows and 30 columns that I would like to query for plotting in an R shiny app.

it contains alpha-numeric data that looks like:

        #Example data
        df<-as.data.frame(percent=as.numeric(rep(c("50","80"),each=5e2)),
                          maskProportion=as.numeric(rep(c("50","80")),each=5e2),
                          dose=runif(1e3),
                          origin=as.factor(rep(c("ABC","DEF"),each=5e2)),
                          destination=as.factor(rep(c("XYZ","GHI"),each=5e2))
                          )

write.csv(df,"PassengerData.csv")

In the terminal, I have ingested it into an SQLite database as follows:

$ sqlite3 -csv PassengerData.sqlite3 '.import PassengerData.csv df'

which is from:

Creating an SQLite DB in R from an CSV file: why is the DB file 0KB and contains no tables?

So far so good.

The problem I have is speed in querying in R so I tried indexing the DB back in the terminal.

In sqlite3, I tried creating indexes on percent, maskProportion, origin and destination following this link https://data.library.virginia.edu/creating-a-sqlite-database-for-use-with-r/ :

$ sqlite3 create index "percent" on PassengerData("percent");
$ sqlite3 create index "origin" on PassengerData("origin");
$ sqlite3 create index "destination" on PassengerData("destination");
$ sqlite3 create index "maskProp" on PassengerData("maskProp");

I run out of disk space because my DB seems to grow in size every time I make an index. E.g. after running the first command the size is 20GB. How can I avoid this?


Solution

  • I assume the concern is that running collect() to transfer data from SQL to R is too slow for your app. It is not clear how / whether you are processing the data in SQL before passing to R.

    Several things to consider:

    • Indexes are not copied from SQL to R. SQL works with data off disk, so knowing where to look up specific parts of your data result in time savings. R works on data in memory so indexes are not required.
    • collect transfers data from a remote table (in this case SQLite) into R memory. If your goal is to transfer data into R, you could read a csv direct into R instead of writing it to SQL and then reading from SQL into R.
    • SQL is a better choice for doing data crunching / preparation of large datasets, and R is a better choice for detailed analysis and visualisation. But if both R and SQL are running on the same machine then both are limited by the cpu speed. Not a concern is SQL and R are running on separate hardware.

    Some things you can do to improve performance:

    (1) Only read the data you need from SQL into R. Prepare the data in SQL first. For example, contrast the following:

    # collect last
    local_r_df = remote_sql_df %>%
      group_by(origin) %>%
      summarise(number = n()) %>%
      collect()
    
    # collect first
    local_r_df = remote_sql_df %>%
      collect() %>%
      group_by(origin) %>%
      summarise(number = n())
    

    Both of these will produce the same output. However, in the first example, the summary takes place in SQL and only the final result is copied to R; while in the second example, the entire table is copied to R where it is then summarized. Collect last will likely have better performance than collect first because it transfers only a small amount of data between SQL and R.

    (2) Preprocess the data for your app. If your app will only examine the data from a limited number of directions, then the data could be preprocessed / pre-summarized.

    For example, suppose users can pick at most two dimensions and receive a cross-tab, then you could calculate all the two-way cross-tabs and save them. This is likely to be much smaller than the entire database. Then at runtime, your app loads the prepared summaries and shows the user any summary they request. This will likely be much faster.