Search code examples
pythonseleniumselenium-webdriverbeautifulsouppython-requests-html

Selenium Chrome web driver inconsistently executes JS scripts on webpages


I'm trying to scrape articles on PubChem, such as this one, for instance. PubChem requires browsers to have Javascript enabled, or else it redirects to a page with virtually no content that says "This application requires Javascript. Please turn on Javascript in order to use this application". To go around this, I used the Chrome web driver from the Selenium library to obtain the HTML that PubChem generates with JavaScript.

And it does that about half the time. It also frequently does not render the full html, and redirects to the Javascript warning page. How do I make it so that the script retrieves the JS version of the site consistently?

I've also tried to overcome this issue by using PhantomJS, except PhantomJS somehow does not work on my machine after installation.

from bs4 import BeautifulSoup
from requests import get
from requests_html import HTMLSession
from selenium import webdriver
import html5lib

session = HTMLSession()
browser = webdriver.Chrome('/Users/user/Documents/chromedriver')
url = "https://pubchem.ncbi.nlm.nih.gov/compound/"
browser.get(url)
innerHTML = browser.execute_script("return document.body.innerHTML")
soup = BeautifulSoup(innerHTML, "html5lib")

There are no error messages whatsoever. The only issues is that sometimes the web scraper cannot obtain the JS-rendered webpage as expected. Thank you so much!


Solution

  • Answering my own question because why not.

    You need to quit your browser by

    browser = webdriver.Chrome('/Users/user/Documents/chromedriver')
    # stuff
    browser.quit()
    

    and do so right after the last operation that involves the browser, as you risk having the browser cache affect your outputs in next iterations of running the script.

    Hope that whoever has this issue finds this helpful!

    UPDATE EDIT:

    So closing the browser does increase the frequency of success, but doesn't make it consistent. Another thing that was helpful in making it work more frequently was running

    sudo purge
    

    in the terminal. I'm still not getting consistent results, however. If anyone has an idea of how to do it without using brute force (i.e. opening and closing the WebDriver until it renders the proper page), please let me know! Many thanks