Search code examples
pythonbeautifulsouppython-requestsgethref

Problem with .Get href link using scraper?


So I am trying to follow a video tutorial that is just a bit outdated. In the video, using href = links[idx].get('href') grabs the link, however if I use it here, it won't work. It just says none. If I just type .getText() it will grab the title.

The element for the entire href and title is <a href="https://mullvad.net/nl/blog/2023/2/2/stop-the-proposal-on-mass-surveillance-of-the-eu/">Stop the proposal on mass surveillance of the EU</a>

Here's my code:

`import requests
from bs4 import BeautifulSoup

res = requests.get('https://news.ycombinator.com/news')
soup = BeautifulSoup(res.text, 'html.parser')
links = soup.select('.titleline')
votes = soup.select('.score')

def create_custom_hn(links, votes):
    hn = []
    for idx, item in enumerate(links):
        title = links[idx].getText()
        href = links[idx].get('href')
        print(href)
        #hn.append({'title': title, 'link': href})
    return hn

print(create_custom_hn(links, votes))`

I tried to grab the link using .get('href')


Solution

  • Try to select your elements more specific and avoid using different lists there is no need for that and you have to ensure that they will have same length.

    You could get all information in one go, selecting the <tr> with class athing and its next sibling.

    Example

    import requests
    from bs4 import BeautifulSoup
    
    
    soup = BeautifulSoup(requests.get('https://news.ycombinator.com/news').text)
    
    data = []
    for i in soup.select('.athing'):
    
        data.append({
            'title' : i.select_one('span a').text,
            'link' : i.select_one('span a').get('href'),
            'score' : list(i.next_sibling.find('span').stripped_strings)[0]
        })
    data
    

    Output

    [{'title': 'Stop the proposal on mass surveillance of the EU',
      'link': 'https://mullvad.net/nl/blog/2023/2/2/stop-the-proposal-on-mass-surveillance-of-the-eu/',
      'score': '287 points'},
     {'title': 'Bay 12 Games has made $7M from the Steam release of Dwarf Fortress',
      'link': 'http://www.bay12forums.com/smf/index.php?topic=181354.0',
      'score': '416 points'},
     {'title': "Google's OSS-Fuzz expands fuzz-reward program to $30000",
      'link': 'https://security.googleblog.com/2023/02/taking-next-step-oss-fuzz-in-2023.html',
      'score': '31 points'},
     {'title': "Connecticut Parents Arrested for Letting Kids Walk to Dunkin' Donuts",
      'link': 'https://reason.com/2023/01/30/dunkin-donuts-parents-arrested-kids-cops-freedom/',
      'score': '225 points'},
     {'title': 'Ronin 2.0 – open-source Ruby toolkit for security research and development',
      'link': 'https://ronin-rb.dev/blog/2023/02/01/ronin-2-0-0-finally-released.html',
      'score': '62 points'},...]