Search code examples
pythonbeautifulsouppython-requestshref

Beautifulsoup Returning Wrong href Value


I'm using the following code for SERP to do some SEO, but when I try reading the href attribute I get incorrect results showing other wired URLs from the page but not the one intended. What is wrong with my code?

import requests
from bs4 import BeautifulSoup

URL = "https://www.google.com/search?q=beautiful+soup&rlz=1C1GCEB_enIN922IN922&oq=beautiful+soup&aqs=chrome..69i57j69i60l3.2455j0j7&sourceid=chrome&ie=UTF-8"
r = requests.get(URL)
webPage = html.unescape(r.text) 

soup = BeautifulSoup(webPage, 'html.parser')
text =''
gresults = soup.findAll('h3') 

for result in gresults:
    print (result.text)
    links = result.parent.parent.find_all('a', href=True)
    for link in links:
        print(link.get('href'))

The output looks like this:

/url?q=https://www.crummy.com/software/BeautifulSoup/bs4/doc/&sa=U&ved=2ahUKEwjv6-q3tJ30AhX_r1YBHU9OAeMQFnoECAAQAg&usg=AOvVaw2Q

Solution

  • What happens?

    • Selecting <h3> only will give you a result set with also unwanted elements.

    • Moving up to parents parent is okay, but try to find_all() (do not use older syntax findAll() in new code) is not necessary, this will give you also <a> you may not want.

    How to fix?

    Select your target element more specific and then you can use:

    result.parent.parent.find('a',href=True).get('href')
    

    But I would recommend to go with the following example.

    Example

    from bs4 import BeautifulSoup
    import requests
    
        
    headers = {'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/62.0.3202.94 Safari/537.36'}
    url = f'http://www.google.com/search?q=beautiful+soup'
    
    r = requests.get(url, headers= headers)
    soup = BeautifulSoup(r.text, 'lxml')
    
    data = []
    
    for r in soup.select('#search a h3'):
        data.append({
            'title':r.text,
            'url':r.parent['href'],
         })
    data   
    

    Output

    [{'title': 'Beautiful Soup 4.9.0 documentation - Crummy',
      'url': 'https://www.crummy.com/software/BeautifulSoup/bs4/doc/'},
     {'title': 'Beautiful Soup Tutorial: Web Scraping mit Python',
      'url': 'https://lerneprogrammieren.de/beautiful-soup-tutorial/'},
     {'title': 'Beautiful Soup 4 - Web Scraping mit Python | HelloCoding',
      'url': 'https://hellocoding.de/blog/coding-language/python/beautiful-soup-4'},
     {'title': 'Beautiful Soup - Wikipedia',
      'url': 'https://de.wikipedia.org/wiki/Beautiful_Soup'},
     {'title': 'Beautiful Soup (HTML parser) - Wikipedia',
      'url': 'https://en.wikipedia.org/wiki/Beautiful_Soup_(HTML_parser)'},
     {'title': 'Beautiful Soup Documentation — Beautiful Soup 4.4.0 ...',
      'url': 'https://beautiful-soup-4.readthedocs.io/'},
     {'title': 'BeautifulSoup4 - PyPI',
      'url': 'https://pypi.org/project/beautifulsoup4/'},
     {'title': 'Web Scraping und Parsen von HTML in Python mit Beautiful ...',
      'url': 'https://www.twilio.com/blog/web-scraping-und-parsen-von-html-python-mit-beautiful-soup'}]