Search code examples
rweb-scrapingrvest

Problems extracting data using JSON in R (getting a lexical error)


Related to the question asked here: R - Using SelectorGadget to grab a dataset

library(rvest)
library(jsonlite)
library(magrittr)
library(stringr)
library(purrr)
library(dplyr)

get_state_index <- function(states, state) {
  return(match(T, map(states, ~ {
    .x$name == state
  })))
}

s <- read_html("https://www.opentable.com/state-of-industry") %>% html_text()
all_data <- jsonlite::parse_json(stringr::str_match(s, "__INITIAL_STATE__ = (.*?\\});w\\.")[, 2])
fullbook <- all_data$covidDataCenter$fullbook

hawaii_dataset <- tibble(
  date = fullbook$headers %>% unlist() %>%  as.Date(),
  yoy = fullbook$states[get_state_index(fullbook$states, "Hawaii")][[1]]$yoy %>% unlist()
)

I am trying to grab the Hawaii dataset from the State tab. The code was working before but now it is throwing an error with this part of the code:

all_data <- jsonlite::parse_json(stringr::str_match(s, "__INITIAL_STATE__ = (.*?\\});w\\.")[, 2])

I am getting the error:

Error: lexical error: invalid char in json text.                                        NA                      (right here) ------^

Any proposed solutions? It seems that the website has remained the same for the year but what type of change is causing the code to break?

EDIT: The solution proposed by @QHarr:

all_data <- jsonlite::parse_json(stringr::str_match(s, "__INITIAL_STATE__ = ([\\s\\S]+\\});")[, 2])

This was working for a while but then it seems that their website again changed the underlying HTML codes.


Solution

  • Change the regex pattern as shown below to ensure it correctly captures the desired string within the response text i.e. the JavaScript object to use for all_data

    all_data <- jsonlite::parse_json(stringr::str_match(s, "__INITIAL_STATE__ = ([\\s\\S]+\\});")[, 2])
    

    enter image description here

    Note: in R the single escape is doubled e.g. \\s rather than shown \s above.