Search code examples
rweb-scrapingrvestpubchem

Rvest web scrape returns empty character


I am looking to scrape some data from a chemical database using R, mainly name, CAS Number, and molecular weight for now. However, I am having trouble getting rvest to extract the information I'm looking for. This is the code I have so far:

library(rvest)
library(magrittr)

# Read HTML code from website
# I am using this format because I ultimately hope to pull specific items from several different websites
webpage <- read_html(paste0("https://pubchem.ncbi.nlm.nih.gov/compound/", 1))

# Use CSS selectors to scrape the chemical name
chem_name_html <- webpage %>%
                  html_nodes(".short .breakword") %>%
                  html_text()

# Convert the data to text
chem_name_data <- html_text(chem_name_html)

However, when I'm trying to create name_html, R only returns character (empty). I am using SelectorGadget to get the HTML node, but I noticed that SelectorGadget gives me a different node than what the Inspector does in Google Chrome. I have tried both ".short .breakword" and ".summary-title short .breakword" in that line of code, but neither gives me what I am looking for.

Screenshot of SelectorGadget and Inspector


Solution

  • I have recently run into the same issues using rvest to scrape PubChem. The problem is that the information on the page is rendered using javascript as you are scrolling down the page, so rvest is only getting minimal information from the page.

    There are a few workarounds though. The simplest way to get the information that you need into R is using an R package called webchem.

    If you are looking up name, CAS number, and molecular weight then you can do something like:

    library(webchem) chem_properties <- pc_prop(1, properties = c('IUPACName', 'MolecularWeight'))

    The full list of compound properties that can be extracted using this api can be found here. Unfortunately there isn't a property through this api to get CAS number, but webchem gives us another way to query that using the Chemical Translation Service.

    chem_cas <- cts_convert(query = '1', from = 'CID', to = 'CAS')

    The second way to get information from the page that is a bit more robust but not quite as easy to work with is by grabbing information from the JSON api.

    library(jsonlite) chem_json <- read_json(paste0("https://pubchem.ncbi.nlm.nih.gov/rest/pug_view/data/compound/", "1", "/JSON/?response_type=save$response_basename=CID_", "1"))

    With that command you'll get a list of lists, which I had to write a convoluted script to parse the information that I needed from the page. If you are familiar with JSON, you can parse far more information from the page, but not quite everything. For example, in sections like Literature, Patents, and Biomolecular Interactions and Pathways, the information in these sections will not fully show up in the JSON information.

    The final and most comprehensive way to get all information from the page is to use something like Scrapy or PhantomJS to render the full html output of the PubChem page, then use rvest to scrape it like you originally intended. This is something that I'm still working on as it is my first time using web scrapers as well.

    I'm still a beginner in this realm, but hopefully this helps you a bit.