Search code examples
python-3.xgoogle-custom-searchgoogle-image-search

Retrieve Google results without using the Custom Search API


Recently I've been working on an idea that requires me to query Google Images and retrieve links for images matching that search term. My most promising candidate for a usable Google Images API was the Google Web Search API, but it looks like it's going to be going out of service as of tomorrow: https://developers.google.com/web-search/docs/

The API that replaced it is the Google Custom Search API, but it's a little discouraging to use:
Google API Custom Search with Python - Programmatic Search Results
100 search results a day is a very strict limit; that's just four searches per hour. I also don't want to have to go through the hassle of creating some custom search bar that I'm never going to use except through Python

I decided to turn to parsing HTML directly from the results page. This presents a problem, though, because nowhere inside the page's HTML is there any direct link to the image, only referrer URLs. This is true of the javascript-enabled and javascript-disabled versions of Google Images (so even if Python spoofs javascript as enabled, nothing). I'm not sure where to go from here. Could anyone refer me to some obscure, updated library that I've somehow overlooked, or give me some pointers?


Solution

  • You could use Selenium Webdriver to actually execute the JavaScript and click on the images in the thumbnail view. Once an image has been opened, the link is in the DOM and you can scrape it from there. All Webdriver does is open an actual browser and simulate a user. You can even run it as a headless browser if you use xvfbwrapper. The downside is that even then, you will need all the dependencies of the browser you are using installed on your server.

    However, scraping Google is against their terms of service and they will make an effort of blocking you as quickly as possible. So, unless you pass through the captchas (which are linked to sessions), you will possibly not be able to make a whole lot of searches before being blocked this way, either.