Search code examples
sqlrpostgresqlwrds

Retrieving data from many CUSIPs using a SQL query on the WRDS MSRB dataset


I am very new to SQL so my apologies if this is an easy question, I didn't find anything while searching but I may have missed obvious search terms.

I am trying to download all transaction data for a set of municipal bonds for which I have a list of CUSIPs, currently being stored as a .txt file with one CUSIP per line. The online version of WRDS allows a user to upload such a .txt file to retrieve their data.

I would like to automate this process in R and followed the WRDS guide for setting up SQL queries in R. Ultimately I will use something along the lines of

res <- dbSendQuery(wrds, "select *
               from msrb.msrb
               where cusip IN ???")
data <- dbFetch(res, n=-1)
dbClearResult(res)
data

How do I actually get my list of CUSIPs into the query? It is too long for it to be practicable to directly list each CUSIP. Can I reference a .txt file somehow, or at least a character vector in R or something? Is there a better approach?


Solution

  • I think there are two efficient ways to programmatically do IN (...) in SQL, and one way that is popular but risky (and I generally discourage it).

    1. Using parameter binding. This is practical up to some subjective limit; there might be a real limit on how many parameters DBI allows to be bound, but I don't know it; I don't know if SQL implementations often limit the number of values you can put in a literal IN (...) statement (I just tested PG11 with 5000, no problem). At some point, it might more efficient or desirable to use option 2 below. However, if we're talking on the order of so many dozen, then try this.

      cusips <- c(...) # vector of CUSIPs
      params <- paste(paste0("$", seq_along(cusips)), collapse = ",")
      ret <- DBI::dbGetQuery(con,
        paste("select * from msrb.msrb where cusip in (", params, ")"),
        params = as.list(cusips))
      

      The use of ($1, $2, $3) is specific to postgres; other DBMSes may use different nomenclature, including (?,?,?) (sql server and others).

    2. Upload the ids into a temporary table and query against it. (This can also be partially used if you acquire the ids to use from another query, just update the inner SQL to reflect your other query.)

      cusips <- c(...) # vector of CUSIPs
      tmptbl <- paste0("tmptable_", paste(sample(9), collapse = ""))
      DBI::dbWriteTable(con, tmptbl, data.frame(cusip = cusips))
      DBI::dbGetQuery(con,
        paste("select * from msrb.msrb where cusip in",
              "(select cusip from", tmptbl, ")"))
      

      or join against the temp table, with

      DBI::dbGetQuery(con,
        paste("select msrb.* from ", tmptbl, "t",
              "left join msrb on t.cusip = msrb.cusip"))
      
    3. In general, I am a staunch advocate for using parameter binding, as it will side-step any form of SQL injection, whether malicious or accidental. If you are in a rush, however, you can form the IN (...) yourself. You can use glue::glue_sql to make sure the correct quotes (for your specific DBMS) are always used; if not, it is often safe to use single quotes.

      cusips <- c(...) # vector of CUSIPs
      params <- paste("(", paste(sQuote(cusips), collapse = ","), ")")
      # or
      params <- glue::glue_sql("({cusips*})", .con = con)
      DBI::dbGetQuery(con, paste("select * from msrb.msrb where cusip in", params))
      

      Note that glue::glue_sql provides the * notation. From ?glue::glue_sql:

      If you place a '*' at the end of a glue expression the values will be collapsed with commas. This is useful for the SQL IN Operator for instance.

    For all three methods, I used the more-direct DBI::dbGetQuery, but you can still use the DBI::dbSendQuery/DBI::dbFetch two-step if you prefer.

    Depending on the size of your msrb table as well as its indices, these queries might not hit all of the optimizations. If that's the case, consider adding to the query based on advice from your DBA.