Search code examples
rhttr2

req_cache with req_perform_parallel in httr2


I'd like to cache the results of a large API pull broken up into a list of smaller pulls that accommodate the API limits. Using httr2, I'm able to set up the request and external directories for the cache, and the pull runs fine, but nothing gets saved in the external directories for later use.

Example code with a smaller request subset (publicly available at NIH as per https://icite.od.nih.gov/api):

library(httr2) # api with cache capability

# list of several URL requests
citeLst <- list()
citeLst$one <- paste(17038628,17320734,16677657, sep=",") # string of PMIDs
citeLst$two <- paste(17380955,17299329,17297311, sep=",")
citeLst$three <- paste(17280498,17141525,17262759, sep=",")
citeLst <- lapply(citeLst, function(x) paste0("https://icite.od.nih.gov/api/pubs?pmids=", x))

reqLst <- lapply(citeLst, request) # set up list of requests
# add cache path for each request in the list - produces an external directory for each citeLst element
cacheLst <- local({
  lengthVec <- as.character(1:length(citeLst))
  lst <- mapply(function(x,y) req_cache(x, path=file.path("cache","citePull",y)),
                reqLst, lengthVec, SIMPLIFY=FALSE)
  return(lst)
})
respLst <- req_perform_parallel(cacheLst) # pulls data fine, but nothing in the cache

req_perform_parallel is designed to work with lists, but req_cache only works with a single character string. That's why I set it up using mapply to create a separate directory for each element in citeLst.

The example here (https://github.com/r-lib/httr2/issues/447) showing that req_cache does work with req_perform_parallel only has a single string in the request.

While the iCiteR package is available for pulling this stuff, I don't see anything about the ability to cache with it.


Solution

  • req_cache() relies on co-operation from the server. From ?req_cache:

    req_cache() caches responses to GET requests that have status code 200 and at least one of the standard caching headers (e.g. Expires, Etag, Last-Modified, Cache-Control)

    The API server at https://icite.od.nih.gov/api does not send any of these caching headers, so no caching is performed. You'd need to roll your own caching solution in this case.

    If you simply want to persist responses without trying to determine if they should be re-fetched, you could do something along these lines:

    library(httr2)
    
    reqs_perform_parallel_with_custom_cache <- function(reqs, cache) {
      # Determine cache key for each request.
      keys <- lapply(reqs, function(req) digest::digest(req$url))
      
      # Fetch corresponding responses from cache.
      resps <- lapply(keys, function(key) cache$get(key, NULL))
      
      # Determine which cached responses are stale.
      is_stale <- sapply(resps, function(resp) is.null(resp))
      
      # Perform requests that had stale responses.
      resps[is_stale] <- req_perform_parallel(reqs[is_stale])
      
      # Update cache with fresh responses.
      Map(function(key, resp) cache$set(key, resp), keys[is_stale], resps[is_stale])
      
      resps
    }
    
    cache <- cachem::cache_mem()
    
    reqs_perform_parallel_with_custom_cache(
      list(
        request("https://icite.od.nih.gov/api/pubs?pmids=17038628,17320734"),
        request("https://icite.od.nih.gov/api/pubs?pmids=17380955,17299329")
      ),
      cache
    )
    

    Switch the cache to cachem::cache_disk() if you want to persist across sessions.