Search code examples
pythonweb-scrapingbeautifulsouphref

Looping through Variable URLs to retrieve HREF tags with BeautifulSoup


I'm very new to Python and think I'm 95% there on this one, but truly can't figure out what could be wrong while troubleshooting:

I'm looking to loop through 50,000 URLs, but the only thing changing in the URL is the final number

Essentially making links like this:

"https://basketball.realgm.com/player/Carmelo-Anthony/Summary/1" "https://basketball.realgm.com/player/Carmelo-Anthony/Summary/2" "https://basketball.realgm.com/player/Carmelo-Anthony/Summary/3"

My next thought was to make a working loop, just to ensure I can do it correctly:

for tag in range(0, 4):
    resp = ("https://basketball.realgm.com/player/Carmelo-Anthony/Summary/" + str(tag))
    print(resp)

Based on the output, this seems to create the exact links I want.

I then wanted to merge it with the code that seemed to scrape all HREF tags from a given list of URLs (final code below):


import requests
from bs4 import BeautifulSoup

profiles = []

for tag in range(0, 50000):
    resp = ("https://basketball.realgm.com/player/Carmelo-Anthony/Summary/" + str(tag))

urls = [
    resp
]

for url in urls:
    req = requests.get(url)
    soup = BeautifulSoup(req.text, 'html.parser')
    for profile in soup.find('div', class_="profile-box").select('.half-column-left > p > a'):
        profile = profile.get('href')
        profiles.append(profile)

# print(profiles)

for p in profiles:
    if p.startswith('https'):
        print(tag, profile)

My confusion then stems from the fact it doesn't ALWAYS work. If I change the range to (0, 7), I do see results.

I did some exploring and saw the URL below gives a 404 tag:

https://basketball.realgm.com/player/Carmelo-Anthony/Summary/8

I figured it should just skip broken links -- I added in an "else" statement, but my results still weren't correct.

Is there something I'm doing wrong here?


Solution

  • You were trying to select a list of elements from an element that sometimes is None.

    Try this:

    import requests
    from bs4 import BeautifulSoup
    
    profiles = []
    
    for page in range(1, 50000):
        req = requests.get("https://basketball.realgm.com/player/Carmelo-Anthony/Summary/{page}".format(page = page))
        soup = BeautifulSoup(req.text, 'html.parser')
        element = soup.find('div', class_="profile-box")
        if element != None:
            for profile in element.select('.half-column-left > p > a'):
                profiles.append(profile.get('href'))
    
    print(profiles)