Search code examples
htmlrrvestwebflow

How to retrieve a multiple tables from a webpage using R


I want to extract all vaccine tables with the description on the left and their description inside the table using R,

this is the link for the webpage

this is how the first table look on the webpage:

enter image description here

I tried using XML package, but I wasn't succeful, I used:

vup<-readHTMLTable("https://milken-institute-covid-19-tracker.webflow.io/#vaccines_intro", which=5)

I get an error:


Error in (function (classes, fdef, mtable)  : 
  unable to find an inherited method for function ‘readHTMLTable’ for signature ‘"NULL"’
In addition: Warning message:
XML content does not seem to be XML: '' 

How to do this?


Solution

  • This webpage does not use a tables thus the reason for your error. Due to the multiple subsections and hidden text, the formatting on the page is quite complicated and requires finding the nodes of interest individually.

    I prefer using the "rvest" and "xml2" package for the easier and more straight forward syntax.
    This is not a complete solution and should get you moving in the correct direction.

    library(rvest)
    library(dplyr)
    
    #find the top of the vacine section
    parentvaccine <- page %>% html_node(xpath="//div[@id='vaccines_intro']") %>% xml_parent()
    
    #find the vacine rows
    vaccines <- parentvaccine %>% html_nodes(xpath = ".//div[@class='chart_row for_vaccines']")
    
    #find info on each one
    company <- vaccines %>% html_node(xpath = ".//div[@class='is_h5-2 is_developer w-richtext']") %>% html_text()
    product <- vaccines %>% html_node(xpath = ".//div[@class='is_h5-2 is_vaccines w-richtext']") %>% html_text()
    phase <- vaccines %>% html_node(xpath = ".//div[@class='is_h5-2 is_stage']") %>% html_text()
    misc <- vaccines %>% html_node(xpath = ".//div[@class='chart_row-expanded for_vaccines']") %>% html_text()
    
    
    #determine vacine type
    #Get vacine type
    vaccinetypes <- parentvaccine %>% html_nodes(xpath = './/div[@class="chart-section for_vaccines"]') %>% 
       html_node('div.is_h3') %>% html_text()
    #dtermine the number of vacines in each category
    lengthvector <-parentvaccine %>% html_nodes(xpath = './/div[@role="list"]') %>% xml_length() %>% sum()
    #make vector of correct length
    VaccineType <- rep(vaccinetypes, each=lengthvector)
    
    answer <- data.frame(VaccineType,  company, product, phase)
    head(answer)
    

    To generate this code, involved reading the html code and identifying the correct nodes and the unique attributes for the desired information.