Search code examples
pythonurllib2google-searchhttplibgoogle-local-search

Google serves its homepage to urllib2 when a local search is made


When a local search is done on Google, then the user clicks on the 'More ...' link below the map, the user is then brought to a page such as this.

If the URL:

https://www.google.com/ncr#q=chiropractors%2BNew+York,+NY&rflfq=1&rlha=0&tbm=lcl

is copied out and pasted back into a browser, one arrives, as expected, at the same page. Likewise when a browser is opened with WebDriver, directly accessing the URL brings WebDriver to the same page.

When an attempt is made, however, to request the same page with urllib2, Google serves it its home page (google.com), and it means, among other things, that lxml's extraction capabilities cannot be used.

While urllib2 is not the culprit here (perhaps Google does the same with all headless requests), is there any way of getting Google to serve the desired page? A quick tests with the requests library is indicating the same issue.


Solution

  • I think the big hint here is in the URL:

    https://www.google.com/ncr#q=chiropractors%2BNew+York,+NY&rflfq=1&rlha=0&tbm=lcl

    Do you notice how there is that hash character (#) in there? Everything following the hash component is never actually sent to the server, so the server can't process it. This indicates (in this case) that the page you are seeing in WebDriver and in your browser is a result of client side scripting.

    When you load up the page, your browser sends a request for https://www.google.com/ncr and google returns the home page. The homepage contains javascript that analyses the component after the hash and uses it to generate the page that you expect to see. The browser and Webdriver can do this because they process the javascript. If you disable javascript in your browser and go to that link, you'll see that the page isn't generated either.

    urllib2 however, does not process javascript. All it sees is the HTML that the website initially sent along with the javascript, but it can't process the javascript that actually generates the page you are expecting.

    Google is serving the page you're asking for, but your problem is that urllib2 is not equipped to render it. To fix this, you'll have to use a scraping framework that supports Javascript. Optionally in this particular case, you could simply use the non-javascript version of Google for your scraping.