Search code examples
pythonseleniumselenium-webdriverweb-scrapingscraper

Selenium webdriver with python to scrape dynamic page cannot find element


So there are a lot of questions that have been asked around dynamic content scraping on stackoverflow, and I went through all of these, but all the solutions suggested did not work for the following problem:

Context:

Issue:

I have not been able to access any of the DOM elements on this page. Note if I could get some hints on how to access the search bar, and the search button, that would be a great start. See page to scrape What I want in the end, is to go through a list of addresses, launch the search, and copy the information displayed on the right hand side of the screen.

I have tried the following:

  • Changed the browser for webdriver (from Chrome to Firefox)
  • Added waiting time for the page to load

    try:
        WebDriverWait(self.driver, 10).until(EC.presence_of_element_located((By.ID, "addressInput")))
    except:
        print "address input not found"
    
  • Tried to access the item by ID, XPATH, NAME, TAG NAME, etc., nothing worked.

Questions

  • What else could I try that I have not so far (using Selenium webdriver)?
  • Are some websites really impossible to scrape? (I don't think that the city used an algorithm to generate any random DOM everytime I re-load the page).

Solution

  • You can use this url http://50.17.237.182/PIM/ to get the source:

    In [73]: from selenium import webdriver
    
    
    In [74]: dr = webdriver.PhantomJS()
    
    In [75]: dr.get("http://50.17.237.182/PIM/")
    
    In [76]: print(dr.find_element_by_id("addressInput"))
    <selenium.webdriver.remote.webelement.WebElement object at 0x7f4d21c80950>
    

    If you look at the source returned, there is a frame attribute with that src url:

    <!DOCTYPE HTML PUBLIC "-//W3C//DTD HTML 4.01//EN"
       "http://www.w3.org/TR/html4/strict.dtd">
    <html>
    
    <head>
      <title>San Francisco Property Information Map </title>
      <META name="description" content="Public access to useful property information and resources at the click of a mouse"><META name="keywords" content="san francisco, property, information, map, public, zoning, preservation, projects, permits, complaints, appeals">
    </head>
    <frameset rows="100%,*" border="0">
      <frame src="http://50.17.237.182/PIM" frameborder="0" />
      <frame frameborder="0" noresize />
    </frameset>
    
    <!-- pageok -->
    <!-- 02 -->
    <!-- -->
    </html>
    

    Thanks to @Alecxe, the simplest method it to use dr.switch_to.frame(0):

    In [77]: dr = webdriver.PhantomJS()
    
    In [78]: dr.get("http://propertymap.sfplanning.org/")
    
    In [79]:  dr.switch_to.frame(0)  
    
    In [80]: print(dr.find_element_by_id("addressInput"))
    <selenium.webdriver.remote.webelement.WebElement object at 0x7f4d21c80190>
    

    If you visit http://50.17.237.182/PIM/ in your browser, you will see exactly the same as propertymap.sfplanning.org/, the only difference is you have full access to the elements using the former.

    If you want to input a value and click the search box, it is something like:

    from selenium import webdriver
    
    
    dr = webdriver.PhantomJS()
    dr.get("http://propertymap.sfplanning.org/")
    
    dr.switch_to.frame(0)
    
    dr.find_element_by_id("addressInput").send_keys("whatever")
    dr.find_element_by_xpath("//input[@title='Search button']").click()
    

    But if you want to pull data, you may find querying using the url an easier option, you will get some json back from the query.

    enter image description here