python google-search python-requests-html

How to get google search page html code using python?

I try to extract the google search page HTML code in python. I use requests module in python.

from bs4 import BeautifulSoup

url = "https://www.google.com/search?q=how+to+get+google+search+page+source+code+by+python"

resp = requests.get(url)
soup = BeautifulSoup(resp.text, 'html.parser')
print(soup)
search = soup.find_all('div',class_="yuRUbf")
print(search)

But I can't find any of this class_="yuRUbf" in the code. I think it do not give me the source code. Now how can I do this work.

I also used resp.content but it didn't work. I also selenium but it didn't work.

Solution

You can use SelectorGadget Chrome extension to easily get CSS selectors by clicking on the desired element in your browser (not always work perfectly if the website is rendered via JavaScript).

To collect information from all pages you can use non-token pagination with while True loop. The while loop is an endless loop, the exit from which in our case is the presence of a switch button to the next page, namely the CSS selector ".d6cvqb a[id=pnnext]":

if soup.select_one('.d6cvqb a[id=pnnext]'):
        params["start"] += 10
else:
    break

Also you can exit the loop by using the limit on the number of search pages:

if page_num == page_limit:
   break

Check code in the online IDE.

from bs4 import BeautifulSoup
import requests, json, lxml

query = "how to get google search page source code by python"

# https://docs.python-requests.org/en/master/user/quickstart/#passing-parameters-in-urls
params = {
    "q": query,          # query example
    "hl": "en",          # language
    "gl": "uk",          # country of the search, UK -> United Kingdom
    "start": 0,          # number page by default up to 0
    #"num": 100          # parameter defines the maximum number of results to return.
}

# https://docs.python-requests.org/en/master/user/quickstart/#custom-headers
headers = {
    "User-Agent": "Mozilla/5.0 (Windows NT 6.1; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/108.0.0.0 Safari/537.36"
}

page_limit = 5           # page limit
page_num = 0

data = []

while True:
    page_num += 1
    print(f"page: {page_num}")
        
    html = requests.get("https://www.google.com/search", params=params, headers=headers, timeout=30)
    soup = BeautifulSoup(html.text, 'lxml')
    
    for result in soup.select(".tF2Cxc"):
        title = result.select_one(".DKV0Md").text
        try:
           snippet = result.select_one(".lEBKkf span").text
        except:
           snippet = None
        links = result.select_one(".yuRUbf a")["href"]
      
        data.append({
          "title": title,
          "snippet": snippet,
          "links": links
        })

    if page_num == page_limit:
        break
    if soup.select_one(".d6cvqb a[id=pnnext]"):
        params["start"] += 10
    else:
        break
print(json.dumps(data, indent=2, ensure_ascii=False))

Example output:

[
  {
    "title": "How To Build a Website With Python - Digital.com",
    "snippet": "Examples of Sites Created Using Python · Google: The most popular search engine in the world uses Python · Instagram: Python was used to create the backend of ...",
    "links": "https://digital.com/how-to-create-a-website/with-python/"
  },
  {
    "title": "Google Search Operators: 40 Commands to Know in 2023 ...",
    "snippet": "30 Mar 2022 — ",
    "links": "https://kinsta.com/blog/google-search-operators/"
  },
  {
    "title": "Python From Scratch: Create a Dynamic Website - Code",
    "snippet": "19 Feb 2022 — ",
    "links": "https://code.tutsplus.com/articles/python-from-scratch-create-a-dynamic-website--net-22787"
  },
  {
    "title": "How to Use Python to Analyze Google Search Results at Scale",
    "snippet": "21 Dec 2020 — ",
    "links": "https://www.semrush.com/blog/analyzing-search-engine-results-pages/"
  },
  other results ...
]

Or you can also use Google Search Engine Results API from SerpApi. It's a paid API with the free plan. The difference is that it will bypass blocks (including CAPTCHA) from Google, no need to create the parser and maintain it.

Code example:

from serpapi import GoogleSearch
from urllib.parse import urlsplit, parse_qsl
import json, os

query = "how to get google search page source code by python"
params = {
  "api_key": "...",                # serpapi key
  "engine": "google",              # serpapi parser engine
  "q": query,                      # search query
  "num": "100"                     # number of results per page (100 per page in this case)
  # other search parameters: https://serpapi.com/search-api#api-parameters
}

search = GoogleSearch(params)      # where data extraction happens

organic_results_data = []
page_num = 0

while True:
    results = search.get_dict()    # JSON -> Python dictionary
    
    page_num += 1
    
    for result in results["organic_results"]:
        organic_results_data.append({
            "title": result.get("title"),
            "snippet": result.get("snippet"),
            "link": result.get("link")
        })
    
    if "next_link" in results.get("serpapi_pagination", []):
        search.params_dict.update(dict(parse_qsl(urlsplit(results.get("serpapi_pagination").get("next_link")).query)))
    else:
        break
    
print(json.dumps(organic_results_data, indent=2, ensure_ascii=False))

Output:

[
  {
    "title": "How To Work with Web Data Using Requests and Beautiful ...",
    "snippet": "This tutorial will go over how to work with the Requests and Beautiful Soup Python packages in order to make use of data from web pages.",
    "link": "https://www.digitalocean.com/community/tutorials/how-to-work-with-web-data-using-requests-and-beautiful-soup-with-python-3"
  },
  {
    "title": "google search - Simply Python",
    "snippet": "I have included part of the code for the noun phrase detection (Under pattern_parsing.py). ... Run google search and obtain page source for the images.",
    "link": "https://simply-python.com/tag/google-search/"
  },
  {
    "title": "Web Scraping Using Selenium Python - Analytics Vidhya",
    "snippet": "Step 2 – Install Chrome Driver · Step 2 – Install Chrome Driver · Step 3 – Specify search URL",
    "link": "https://www.analyticsvidhya.com/blog/2020/08/web-scraping-selenium-with-python/"
  },
  other results ...
]

There's a 13 ways to scrape any public data from any website blog post if you want to know more about website scraping.