Search code examples
pythonhtmlweb-scrapingurllib2google-search

Different html code result(UI) between using google image scraper written in Python vs. web browser


I wrote a Google image scraper in Python using urllib2 and BeautifulSoup library, which sends a search request using URL including the query and then fetches the links to the first 10 images. What I need is the direct link of the image, for example:

http://images.mentalfloss.com/sites/default/files/styles/insert_main_wide_image/public/einstein1_7.jpg

When I search for the query using my browser(which is Chrome) and view the HTML code of the image search results page, the code includes the direct URL to the image(like above) and also the URL to the page that includes the image:

http://mentalfloss.com/article/49222/11-unserious-photos-albert-einstein

However, the HTML code of the search result page that I get using my python scraper doesn't include the direct URL to the image, but only the URL to the original page that includes the image. When I save the result HTML and view the file on my browser, it shows some old Google image search UI. Clicking on one of the thumbnail images would cause a 'Your file was not found. It may have been moved or deleted' error.

I am aware that the search settings of when using the browser application and sending a URL request using python library are different, but I am not sure which parameter is causing this difference.

I attached images to the two different result UIs(above is the result HTML page of my python scraper, bottom is the result of the Chrome browser)

image result page of the python scraper

image result page of the web browser (Chrome)

And here is part of my script:

def search_image_google(name):
    google_url = "https://www.google.com/search?btnG=Search&site=webhp&tbm=isch&source=hp&q={}"
    headers={'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/52.0.2743.116 Safari/537.36'}
    url = google_url.format(urllib2.quote(name+' face'))

    try:
        page = requests.get(url).text
        soup = BeautifulSoup(page, 'html.parser')
        result = soup.prettify("utf-8")
        with open('output.html', 'wb') as file:
            file.write(result)

        cnt = 0
        for link in soup.find_all('table', class_ = 'images_table'):
            for child in link.contents:
                for row in child:
                    if cnt > 9:
                        break;
                    else:
                        img_link = str(row.a['href'])[7:]
                        cnt += 1
                        print(img_link)

    except Exception as e:
        print('Exception: %s' % str(e))

Please help!


Solution

  • Try examining all the HTTP headers your browser sends, you may need more than the user-agent.

    Also remember to respect the site's /robots.txt!