I need to scrape Google search result links.
However I keep getting HTTP error 429 even though I put time.sleep()
in my code.
It works for 50 - 100 rows, then gives error 429. But there are hundreds of barcode links I need to scrape.
How can I solve this problem?
import time
from itertools import chain
import pandas as pd
import requests
from bs4 import BeautifulSoup
barcode_df = pd.read_csv("C:/Users/emina/Coding_Projects/PycharmProjects/drug_interaction(pycharm)/barcodes.csv")
barcode_list2d = barcode_df.values.tolist()
barcode_list = list(chain.from_iterable(barcode_list2d)) # This is the list we'll iterate over
barcode_list = [x for x in barcode_list if type(x) == str]
barcode_list_deneme = barcode_list[0:20]
barcode_list1 = barcode_list[0:1000]
USER_AGENT = "some user agent"
headers = {"user-agent": USER_AGENT}
def append_links_to_csv(barcode):
if resp.status_code == 200:
soup = BeautifulSoup(resp.content, "html.parser")
for g in soup.find_all('div', class_='r'):
anchors = g.find_all('a')
if anchors:
link = anchors[0]['href'] # Parses search link
l.write(barcode + "," + link + "\n")
time.sleep(0.06)
else:
print(resp.status_code)
count = 0
l = open("links.csv", "a")
for barcode in barcode_list1:
query = barcode + "+" + "site:ilacabak.com"
url = f"https://google.com/search?q={query}"
resp = requests.get(url, headers=headers)
append_links_to_csv(barcode)
count += 1
print(count)
time.sleep(1.5)
if count % 100 == 0:
l.close()
l = open("links.csv", "a")
l.close()
One thing you could try to do is to rotate user-agents
in combination with Retry-After
header which will indicate how long to wait before making a new request if the response status code is 429, as already suggested by Sameer Naik.
For example:
import random, requests
user_agent_list = [
'Mozilla/5.0 (Macintosh; Intel Mac OS X 10_15_5) AppleWebKit/605.1.15 (KHTML, like Gecko) Version/13.1.1 Safari/605.1.15',
'Mozilla/5.0 (Windows NT 10.0; Win64; x64; rv:77.0) Gecko/20100101 Firefox/77.0',
'Mozilla/5.0 (Macintosh; Intel Mac OS X 10_15_5) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/83.0.4103.97 Safari/537.36',
'Mozilla/5.0 (Macintosh; Intel Mac OS X 10.15; rv:77.0) Gecko/20100101 Firefox/77.0',
'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/83.0.4103.97 Safari/537.36',
]
for _ in range(len(user_agent_list)):
#Pick a random user agent
user_agent = random.choice(user_agent_list)
headers = {'User-Agent': user_agent}
requests.get('URL', headers=headers)
Alternatively, you can forget to think about it by using Google Search Engine Results API from SerpApi. It's a paid API with a free plan. Check out the Playground.
The difference in your case is that you need to think about what data you want to extract from structured JSON, rather than figuring out ways to bypass blocks from Google (or other search engines).
Or how to extract certain elements from the HTML (especially if the data you need is located in the JavaSript but you don't want to use browser automation, such as selenium
or requests-html
)
Example code to integrate:
from serpapi import GoogleSearch
params = {
"api_key": "YOUR_API_KEY",
"engine": "google",
"q": "Coffee",
"location": "Austin, Texas, United States",
"google_domain": "google.com",
"gl": "us",
"hl": "en"
# other query parameters
}
search = GoogleSearch(params)
results = search.get_dict()
for result in results['organic_results']:
print(result['title'], result['link'], sep='\n')
# prints all results from the first page of organic results
Disclaimer, I work for SerpApi.