Search code examples

Correct way to get response body of XHR requests generated by a page with RStudio Chromote

I'd like to use Chromote to gather the response body of the XHR calls made by a website, but I find the API a bit complex to master, especially the async pipeline.

I guess I need to first enable the Network functionality and then load the page (this can do), but then I need to:

  • list all XHR calls
  • filter them by recognizing patterns in the request URL
  • access the request body of the selected sources

Can someone provide any guidance or tutorial material on this regard?

UPDATE: Ok, I switched to package crrri and made a general function for the purpose. The only missing part is some logic to decide when to close the connection and return the results:

get_website_resources <- function(url, url_filter = '*', type_filter = '*') {

  chrome <- Chrome$new()
  out <- new.env()
  out$l <- list()
  client <- chrome$connect(callback = ~ NULL)
  Fetch <- client$Fetch
  Page <- client$Page
  Fetch$enable(patterns = list(list(urlPattern="*", requestStage="Response"))) %...>% {
    Fetch$requestPaused(callback = function(params) {
      if (str_detect(params$request$url, url_filter) & str_detect(params$resourceType, type_filter)) {
        Fetch$getResponseBody(requestId = params$requestId) %...>% {
          resp <- .
          if (resp$body != '') {
            if (resp$base64Encoded) resp$body = base64_dec(resp$body) %>% rawToChar()
            body <- list(list(
              url = params$request$url,
              response = resp
            )) %>% set_names(params$requestId)
            out$l <- append(out$l, body)
      Fetch$continueRequest(requestId = params$requestId)
  } %...>% {


  • Cracked it. Here's the final function. It uses a crrri::perform_with_chrome wich force synch behaviour and run the rest of the process into a promise object with a resolve callback defined outside the promise itself which is called either if a number of resources are collected or if a certain amount of time has passed:

    get_website_resources <- function(url, url_filter = '*', type_filter = '*', wait_for = 20, n_of_resources = NULL, interactive = F) {
        crrri::perform_with_chrome(function(client) {
            Fetch <- client$Fetch
            Page <- client$Page
            if (interactive) client$inspect()
            out <- new.env()
            out$results <- list()
            out$resolve_function <- NULL
            out$pr <- promises::promise(function(resolve, reject) {
                out$resolve_function <- resolve
                Fetch$enable(patterns = list(list(urlPattern="*", requestStage="Response"))) %...>% {
                    Fetch$requestPaused(callback = function(params) {
                        if (str_detect(params$request$url, url_filter) & str_detect(params$resourceType, type_filter)) {
                            Fetch$getResponseBody(requestId = params$requestId) %...>% {
                                resp <- .
                                if (resp$body != '') {
                                    if (resp$base64Encoded) resp$body = jsonlite::base64_dec(resp$body) %>% rawToChar()
                                    body <- list(list(
                                        url = params$request$url,
                                        response = resp
                                    )) %>% set_names(params$requestId)
                                    out$results <- append(out$results, body)
                                    if (!is.null(n_of_resources) & length(out$results) >= n_of_resources) out$resolve_function(out$results)
                        Fetch$continueRequest(requestId = params$requestId)
                } %...>% {
                } %>% crrri::wait(wait_for) %>%
                    then(~ out$resolve_function(out$results))
            out$pr$then(function(x) x)
        }, timeouts = max(wait_for + 3, 30), cleaning_timeout = max(wait_for + 3, 30))