Search code examples
sqlroptimizationteradatapseudocode

Optimizing R script for querying Teradata data dynamically and handling weekends


I currently have a process in place, but need to reformulate it to be more efficient. I will describe the new process and I would like to know if the new process is optimal and follows proper coding practice.

In essence, I have one data source, Teradata. I am pulling data from Teradata using SQL, and these SQL queries are embedded in R. This R script is to be ran daily on weekdays, but not on weekends. The R script should pull the last 14 days worth of data, however since it is ran daily, it makes sense that after the first data pull, I only need to pull the newest day's worth of data, from that point on.

There's 7 separate tables I am utilizing from Teradata, that is also loaded daily. I should have some sort of trigger that checks each table to see if it has been loaded with new data, and if it has, then pull the new data. If not, sit idle for 5 minutes then check again.

Furthermore, on Tuesday, Wednesday, Thursday, and Friday, I only need to query the latest day's worth of new data. However, on Monday, since the process is not run over the weekend, I will need the 3 latest days worth of data (Sunday, Saturday, Friday).

Here is some of my very rough outline of the pseudocode:

1. Define functions:

  ~~~


2. Pull in the base data for the last 14 days:

     A <- sql_query_last_14_days
     ## this assignment only needs to be ran once, thus will comment it out after the 
     ## first go around

     B <- saved RData file
     ## this will not be used in the first run of the process, but will be used for the 
     ## second run and every subsequent run

3. Loop to check for table loading and query data:

    while True:
        for each table in tables_list:
            if table_is_loaded(table):
                if is_monday():
                    query_and_append_data_for_monday(table)
                else:
                    query_and_append_data(table)
                break
            else:
                wait_for_table_loading()

4. Data Manipulation

5. Export the updated dataset to CSV.

6. Save the dataset to an .RData file.



Solution

    1. Initialize a process with now <- Sys.time() and prevtime <- now - 14*86400 (two weeks ago), query all data that is timestamped between those two times. (Realistically, if you run this as a script, it will be milliseconds between when you define now and when the query hits, so it should effectively be "all newer data". Once you get this data, save (saveRDS(list(dat = querydat, time = now), format(now, format="%Y%m%d.rds")) or similar.

    2. When you start up the next day, pull the previous day's .rds file (e.g., yest <- readRDS("20240402.rds")), use prevtime <- yest$time and now <- Sys.time() and run your query again. When this occurs on a Monday, it will naturally query for all weekend, since the last saved .rds file should be from the previous Friday. (Also, if you happen to skip a day due to holiday or vacation or "oops" or something else, the next query will automatically include where you left off.)

    3. With that data, process the data how you need to. Depending on the size of the data and the complexity of your computations on it, I generally prefer to save the "raw" (immediately post-query) data; I do this because if anything in my processing changes, I don't need to re-download raw data to be able to start over on the computation. If the computation is "expensive" and the size is not prohibitive, consider storing both the raw data and the post-calc data into the .rds file (or a separate .rds file) to reduce the required effort "tomorrow".

    File format is mostly preference. I prefer .rds over .rda since I don't like load(.) creating multiple objects in my environment, and I really like keeping the raw data and its relevant timestamp in the same place. If you need to export as CSV (e.g., for some external non-R process) then that's fine, otherwise I tend to know use CSV when I'm in a hurry, the data is not large, and I expect to need to look at the raw data manually and/or with Excel. If I am "deploying" data data for some web service/endpoint/shiny-app, then I often use .parquet files for the tabular lazy-access, compression, efficiency, and near-.rds speed, much faster than CSV and much more "large-data"-friendly.