I'm using the following code for SERP to do some SEO, but when I try reading the href
attribute I get incorrect results showing other wired URLs from the page but not the one intended. What is wrong with my code?
import requests
from bs4 import BeautifulSoup
URL = "https://www.google.com/search?q=beautiful+soup&rlz=1C1GCEB_enIN922IN922&oq=beautiful+soup&aqs=chrome..69i57j69i60l3.2455j0j7&sourceid=chrome&ie=UTF-8"
r = requests.get(URL)
webPage = html.unescape(r.text)
soup = BeautifulSoup(webPage, 'html.parser')
text =''
gresults = soup.findAll('h3')
for result in gresults:
print (result.text)
links = result.parent.parent.find_all('a', href=True)
for link in links:
print(link.get('href'))
The output looks like this:
/url?q=https://www.crummy.com/software/BeautifulSoup/bs4/doc/&sa=U&ved=2ahUKEwjv6-q3tJ30AhX_r1YBHU9OAeMQFnoECAAQAg&usg=AOvVaw2Q
Selecting <h3>
only will give you a result set with also unwanted elements.
Moving up to parents parent
is okay, but try to find_all()
(do not use older syntax findAll()
in new code) is not necessary, this will give you also <a>
you may not want.
Select your target element more specific and then you can use:
result.parent.parent.find('a',href=True).get('href')
But I would recommend to go with the following example.
from bs4 import BeautifulSoup
import requests
headers = {'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/62.0.3202.94 Safari/537.36'}
url = f'http://www.google.com/search?q=beautiful+soup'
r = requests.get(url, headers= headers)
soup = BeautifulSoup(r.text, 'lxml')
data = []
for r in soup.select('#search a h3'):
data.append({
'title':r.text,
'url':r.parent['href'],
})
data
[{'title': 'Beautiful Soup 4.9.0 documentation - Crummy',
'url': 'https://www.crummy.com/software/BeautifulSoup/bs4/doc/'},
{'title': 'Beautiful Soup Tutorial: Web Scraping mit Python',
'url': 'https://lerneprogrammieren.de/beautiful-soup-tutorial/'},
{'title': 'Beautiful Soup 4 - Web Scraping mit Python | HelloCoding',
'url': 'https://hellocoding.de/blog/coding-language/python/beautiful-soup-4'},
{'title': 'Beautiful Soup - Wikipedia',
'url': 'https://de.wikipedia.org/wiki/Beautiful_Soup'},
{'title': 'Beautiful Soup (HTML parser) - Wikipedia',
'url': 'https://en.wikipedia.org/wiki/Beautiful_Soup_(HTML_parser)'},
{'title': 'Beautiful Soup Documentation — Beautiful Soup 4.4.0 ...',
'url': 'https://beautiful-soup-4.readthedocs.io/'},
{'title': 'BeautifulSoup4 - PyPI',
'url': 'https://pypi.org/project/beautifulsoup4/'},
{'title': 'Web Scraping und Parsen von HTML in Python mit Beautiful ...',
'url': 'https://www.twilio.com/blog/web-scraping-und-parsen-von-html-python-mit-beautiful-soup'}]