Search code examples
pythonweb-scrapinghtml-parsingpython-requestsgoogle-custom-search

How to scrape more than 100 google pages in one pass


I am using the requests library in python to GET data from google results. https://www.google.com.pk/#q=pizza&num=10 will return first 10 results of google as I mentioned num=10. Ultimately https://www.google.com.pk/#q=pizza&num=100 will return 100 results of google results.

But

If i write any number more than 100 let https://www.google.com.pk/#q=pizza&num=200 , google is still returning first 100 results

How can I get more than 100 in one pass?

Code:

import requests
url = 'http://www.google.com/search'
my_headers = { 'User-agent' : 'Mozilla/11.0' }
payload = { 'q' : pizza, 'start' : '0', 'num' : 200 }
r = requests.get( url, params = payload, headers = my_headers )

In "r" I am getting only URL's of google first 100 results, not 200


Solution

  • You can use a more programmatic api from google to get the results vs. trying to screen scrape the human search interface, there's no error checking or assertion this is complies with all google T&Cs, suggest you look into the details of using this url:

    import requests
    
    def search(query, pages=4, rsz=8):
        url = 'https://ajax.googleapis.com/ajax/services/search/web'
        params = {
            'v': 1.0,     # Version
            'q': query,   # Query string
            'rsz': rsz,   # Result set size - max 8
        }
    
        for s in range(0, pages*rsz+1, rsz):
            params['start'] = s
            r = requests.get(url, params=params)
            for result in r.json()['responseData']['results']:
                yield result
    

    E.g. getting 200 results for 'google':

    >>> list(search('google', pages=24, rsz=8))
    [{'GsearchResultClass': 'GwebSearch',
      'cacheUrl': 'http://www.google.com/search?q=cache:y14FcUQOGl4J:www.google.com',
      'content': 'Search the world&#39;s information, including webpages, images, videos and more. \n<b>Google</b> has many special features to help you find exactly what you&#39;re looking\xa0...',
      'title': '<b>Google</b>',
      'titleNoFormatting': 'Google',
      'unescapedUrl': 'https://www.google.com/',
      'url': 'https://www.google.com/',
      'visibleUrl': 'www.google.com'},
      ...
    ]
    

    To use Google's Custom Search API you need to sign up as a developer. You get 100 free queries (I'm not sure if that is API calls or it allows pagination of the same query to count as 1 query) a day:

    • Sign up @ https://console.developers.google.com
    • Create a project
    • Create a key
    • Enable Custom Search API
    • Create a Custom Search Engine @ https://cse.google.com
      • Use a dummy site to initialise the CSE
      • Edit the CSE to search the entire web
      • Delete the dummy site
    • Get the CSE reference (look at the public URL for cx=<cse reference>)

    The you can use requests to make the query:

    import requests
    url = 'https://www.googleapis.com/customsearch/v1'
    params = {
        'key': '<key>',
        'cx': '<cse reference>',
        'q': '<search>',
        'num': 10,
        'start': 1
    }
    
    resp = requests.get(url, params=params)
    results = resp.json()['items']
    

    With start you can do a similar pagination to the above.

    There are lots of other parameters available you can look at the REST documentation for the CSE: https://developers.google.com/custom-search/json-api/v1/reference/cse/list#request

    Google also has a client-api library: pip install google-api-python-client you can also use:

    from googleapiclient import discovery
    service = discovery.build('customsearch', 'v1', developerKey='<key>')
    params = {
        'q': '<query>',
        'cx': '<cse reference>',
        'num': 10,
        'start': 1
    }
    query = service.cse().list(**params)
    results = query.execute()['items']