Search code examples
sqlrproductiondbplyr

R in production - working down an SQL database in batched groups until database is exhausted?


Background

I am having a problem with batch processing, of an R script (approx. 1000 lines) which requires calls to a SQL backend to pick up data. The data is pulled into R from SQL via odbc, and I do have writing privileges to the backend. I have stored as much as possible in the backend to reduce local memory usage.

However, an issue is a caused by a few memory intensive processes, such as a huge number of correlations, and imputations for time-series structures. When the script is applied to the entire dataset, this takes up a lot of local memory, that is not discounting the size of the database in general just to pull the data.

Script and data structures

My script is designed so that batches of 20 'item groups', with up to 100 items in each group, are used to get the time-series data for each item in the group. This data is are stored as a nested dataframe relative to the 'item group' (see image below), and is used throughout the script to facilitate parallel processing on the selected batch.
enter image description here

Obviously if I expand the nested table to allow 2000+ 'item groups', I will crash my computer just loading the data, before the memory intensive processes are applied. So I though the best approach may be to create a grouping variable in the backend table (numbered 1 to 200) for each chunk/batch I want to call into the R script.

The Question

I can also use the backend to work down the chunking/batch group variable, removing groups as they are processed. My problem is now how to make the script run until all the chucks have been processed, either directly in the R script or in shell. What is the best approach for this? Is there any examples of working through a database like this.

I appreciate that this is more of an approach question than coding, but I would be happy with any coding examples that may be useful as well as approach advice. I am new to productising R code so anything would be helpful.

Thank you for reading.

FYI I do not have the luxury of using cloud computing at this time.


Solution

  • For iterating over large datasets, I use the modulus of a numeric column.

    For example, the following code will subset to (approximately) one tenth of your dataset:

    large_remote_table = tbl(db_con, ...)
    
    subset_remote_table = large_remote_table %>%
      filter(numeric_column %% 10 == 0)
    

    What should you use as a numeric column? If your dataset already has some kind of numeric identities, then these are a good choice. Otherwise, I recommend you modify your input dataset to include a row number column. Many SQL flavors have a build in row number function.

    We can then iterate over all the subsets using a for loop:

    number_subsets = 2000
    
    large_remote_table = tbl(db_con, ...)
    
    output_list = list()
    
    for(ii in 1:number_subsets){
      this_subset = large_remote_table %>%
        filter(row_number %% number_subsets == ii)
    
      # further processing of subset
    
      # store result
      output_list[ii] = results_from_processing_of_subset
    }