Search code examples
htmlrweb-scrapingrvest

Scrape a Table-Like Index from HTML in R


I am currently working to scrape the table at this website, which contains variable IDs, question text, variable type, and origin dataset from ICPSR's PATH Survey data. My end goal is to create a spreadsheet inventory matrix of variable IDs and their corresponding question text by scraping this information in R, but I am having trouble getting it to work. In short, I aim to essentially get the table shown at the url above into a spreadsheet.

I've tried using rvest,XML, and a number of other packages/strategies (read.table,htmltab,htmltable,etc...), but the underlying table does not appear to be a table-like object "under the hood", if you will. Therefore, I am struggling to find a resource/previous question that helps scrape a table that may not necessarily be a table in structure, but certainly is a table visually.

Any help would be appreciated on this. Thanks!


Solution

  • I think most of that content is located within a script tag from which it is pulled dynamically within the browser via JavaScript during rendering the page.

    You can regex out the appropriate JavaScript object and handle as json. However, given the variability within the returned list under response$docs, you are going to need to spend some time studying the json and determining what you want, and how you will organise output, then write a custom function to apply to the list to return possibly a dataframe of results.

    The following shows how to extract the documents list:

    library(rvest)
    library(stringr)
    library(magrittr)
    library(jsonlite)
    
    s <- read_html('https://www.icpsr.umich.edu/web/NAHDAP/search/variables?start=0&sort=STUDYID asc,DATASETID asc,STARTPOS asc&SERIESFULL_FACET_Q=606|Population Assessment of Tobacco and Health (PATH) Study Series&DATASETTITLE_FACET=Wave 4: Youth / Parent Questionnaire Data&EXTERNAL_FLAG=1&ARCHIVE=NAHDAP&rows=1000#') %>% 
    html_text()
    
    r <- stringr::str_match(s, 'searchResults : (\\{.*\\}), searchConfig')
    
    data <- jsonlite::parse_json(r[1,2])
    docs <- data$response$docs
    

    And this is a sample item in the list (bearing in mind variability of items within list):

    enter image description here