I'm trying to scrape articles on PubChem, such as this one, for instance. PubChem requires browsers to have Javascript enabled, or else it redirects to a page with virtually no content that says "This application requires Javascript. Please turn on Javascript in order to use this application". To go around this, I used the Chrome web driver from the Selenium library to obtain the HTML that PubChem generates with JavaScript.
And it does that about half the time. It also frequently does not render the full html, and redirects to the Javascript warning page. How do I make it so that the script retrieves the JS version of the site consistently?
I've also tried to overcome this issue by using PhantomJS, except PhantomJS somehow does not work on my machine after installation.
from bs4 import BeautifulSoup
from requests import get
from requests_html import HTMLSession
from selenium import webdriver
import html5lib
session = HTMLSession()
browser = webdriver.Chrome('/Users/user/Documents/chromedriver')
url = "https://pubchem.ncbi.nlm.nih.gov/compound/"
browser.get(url)
innerHTML = browser.execute_script("return document.body.innerHTML")
soup = BeautifulSoup(innerHTML, "html5lib")
There are no error messages whatsoever. The only issues is that sometimes the web scraper cannot obtain the JS-rendered webpage as expected. Thank you so much!
Answering my own question because why not.
You need to quit your browser by
browser = webdriver.Chrome('/Users/user/Documents/chromedriver')
# stuff
browser.quit()
and do so right after the last operation that involves the browser, as you risk having the browser cache affect your outputs in next iterations of running the script.
Hope that whoever has this issue finds this helpful!
UPDATE EDIT:
So closing the browser does increase the frequency of success, but doesn't make it consistent. Another thing that was helpful in making it work more frequently was running
sudo purge
in the terminal. I'm still not getting consistent results, however. If anyone has an idea of how to do it without using brute force (i.e. opening and closing the WebDriver until it renders the proper page), please let me know! Many thanks