Search code examples
pythonseleniumproxy-server

Unable to Identify Webpage in BeautifulSoup by URL


I am using Python and Selenium to attempting to scrape all of the links from the results page of a certain search page. No matter what I search for in the previous screen, the URL for any search on the results page is: "https://chem.nlm.nih.gov/chemidplus/ProxyServlet" If I use Selenium to autosearch, then try to read this URL into BeautifulSoup, I get HTTPError: HTTP Error 404: Not Found

Here is my code:

from selenium import webdriver
from selenium.webdriver.support.ui import Select
from selenium.webdriver.common.by import By
from urllib.request import urlopen
from bs4 import BeautifulSoup
import csv


# create a new Firefox session
driver = webdriver.Firefox()
# wait 3 seconds for the page to load
driver.implicitly_wait(3)

# navigate to ChemIDPlus Website
driver.get("https://chem.nlm.nih.gov/chemidplus/")
#implicit wait 10 seconds for drop-down menu to load
driver.implicitly_wait(10)

#open drop-down menu QV7 ("Route:")
select=Select(driver.find_element_by_name("QV7"))
#select "inhalation" in QV7
select.select_by_visible_text("inhalation")
#identify submit button

search="/html/body/div[2]/div/div[2]/div/div[2]/form/div[1]/div/span/button[1]"

#click submit button
driver.find_element_by_xpath(search).click()

#increase the number of results per page
select=Select(driver.find_element_by_id("selRowsPerPage"))
select.select_by_visible_text("25")
#wait 3 seconds
driver.implicitly_wait(3)

#identify current search page...HERE IS THE ERROR, I THINK
url1="https://chem.nlm.nih.gov/chemidplus/ProxyServlet"
page1=urlopen(url1)
#read the search page
soup=BeautifulSoup(page1.content, 'html.parser')

I suspect this has something to do with the proxyserver and Python is not receiving the necessary info to identify the website, but I'm not sure how to work around this. Thanks in advance!


Solution

  • I used Selenium to identify the new URL as a work-around for identifying the proper search page: url1=driver.current_url Next, I used requests to get the content and feed it into beautifulsoup. All together, I added:

    #Added to the top of the script
    import requests
    ...
    #identify the current search page with Selenium
    url1=driver.current_url
    #scrape the content of the results page
    r=requests.get(url)
    soup=BeautifulSoup(r.content, 'html.parser')
    ...