I try to extract the google search page HTML code in python. I use requests
module in python.
from bs4 import BeautifulSoup
url = "https://www.google.com/search?q=how+to+get+google+search+page+source+code+by+python"
resp = requests.get(url)
soup = BeautifulSoup(resp.text, 'html.parser')
print(soup)
search = soup.find_all('div',class_="yuRUbf")
print(search)
But I can't find any of this class_="yuRUbf"
in the code. I think it do not give me the source code. Now how can I do this work.
I also used resp.content
but it didn't work.
I also selenium
but it didn't work.
You can use SelectorGadget Chrome extension to easily get CSS selectors by clicking on the desired element in your browser (not always work perfectly if the website is rendered via JavaScript).
To collect information from all pages you can use non-token pagination with while True
loop. The while loop is an endless loop, the exit from which in our case is the presence of a switch button to the next page, namely the CSS selector ".d6cvqb a[id=pnnext]":
if soup.select_one('.d6cvqb a[id=pnnext]'):
params["start"] += 10
else:
break
Also you can exit the loop by using the limit on the number of search pages:
if page_num == page_limit:
break
Check code in the online IDE.
from bs4 import BeautifulSoup
import requests, json, lxml
query = "how to get google search page source code by python"
# https://docs.python-requests.org/en/master/user/quickstart/#passing-parameters-in-urls
params = {
"q": query, # query example
"hl": "en", # language
"gl": "uk", # country of the search, UK -> United Kingdom
"start": 0, # number page by default up to 0
#"num": 100 # parameter defines the maximum number of results to return.
}
# https://docs.python-requests.org/en/master/user/quickstart/#custom-headers
headers = {
"User-Agent": "Mozilla/5.0 (Windows NT 6.1; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/108.0.0.0 Safari/537.36"
}
page_limit = 5 # page limit
page_num = 0
data = []
while True:
page_num += 1
print(f"page: {page_num}")
html = requests.get("https://www.google.com/search", params=params, headers=headers, timeout=30)
soup = BeautifulSoup(html.text, 'lxml')
for result in soup.select(".tF2Cxc"):
title = result.select_one(".DKV0Md").text
try:
snippet = result.select_one(".lEBKkf span").text
except:
snippet = None
links = result.select_one(".yuRUbf a")["href"]
data.append({
"title": title,
"snippet": snippet,
"links": links
})
if page_num == page_limit:
break
if soup.select_one(".d6cvqb a[id=pnnext]"):
params["start"] += 10
else:
break
print(json.dumps(data, indent=2, ensure_ascii=False))
Example output:
[
{
"title": "How To Build a Website With Python - Digital.com",
"snippet": "Examples of Sites Created Using Python · Google: The most popular search engine in the world uses Python · Instagram: Python was used to create the backend of ...",
"links": "https://digital.com/how-to-create-a-website/with-python/"
},
{
"title": "Google Search Operators: 40 Commands to Know in 2023 ...",
"snippet": "30 Mar 2022 — ",
"links": "https://kinsta.com/blog/google-search-operators/"
},
{
"title": "Python From Scratch: Create a Dynamic Website - Code",
"snippet": "19 Feb 2022 — ",
"links": "https://code.tutsplus.com/articles/python-from-scratch-create-a-dynamic-website--net-22787"
},
{
"title": "How to Use Python to Analyze Google Search Results at Scale",
"snippet": "21 Dec 2020 — ",
"links": "https://www.semrush.com/blog/analyzing-search-engine-results-pages/"
},
other results ...
]
Or you can also use Google Search Engine Results API from SerpApi. It's a paid API with the free plan. The difference is that it will bypass blocks (including CAPTCHA) from Google, no need to create the parser and maintain it.
Code example:
from serpapi import GoogleSearch
from urllib.parse import urlsplit, parse_qsl
import json, os
query = "how to get google search page source code by python"
params = {
"api_key": "...", # serpapi key
"engine": "google", # serpapi parser engine
"q": query, # search query
"num": "100" # number of results per page (100 per page in this case)
# other search parameters: https://serpapi.com/search-api#api-parameters
}
search = GoogleSearch(params) # where data extraction happens
organic_results_data = []
page_num = 0
while True:
results = search.get_dict() # JSON -> Python dictionary
page_num += 1
for result in results["organic_results"]:
organic_results_data.append({
"title": result.get("title"),
"snippet": result.get("snippet"),
"link": result.get("link")
})
if "next_link" in results.get("serpapi_pagination", []):
search.params_dict.update(dict(parse_qsl(urlsplit(results.get("serpapi_pagination").get("next_link")).query)))
else:
break
print(json.dumps(organic_results_data, indent=2, ensure_ascii=False))
Output:
[
{
"title": "How To Work with Web Data Using Requests and Beautiful ...",
"snippet": "This tutorial will go over how to work with the Requests and Beautiful Soup Python packages in order to make use of data from web pages.",
"link": "https://www.digitalocean.com/community/tutorials/how-to-work-with-web-data-using-requests-and-beautiful-soup-with-python-3"
},
{
"title": "google search - Simply Python",
"snippet": "I have included part of the code for the noun phrase detection (Under pattern_parsing.py). ... Run google search and obtain page source for the images.",
"link": "https://simply-python.com/tag/google-search/"
},
{
"title": "Web Scraping Using Selenium Python - Analytics Vidhya",
"snippet": "Step 2 – Install Chrome Driver · Step 2 – Install Chrome Driver · Step 3 – Specify search URL",
"link": "https://www.analyticsvidhya.com/blog/2020/08/web-scraping-selenium-with-python/"
},
other results ...
]
There's a 13 ways to scrape any public data from any website blog post if you want to know more about website scraping.