Search code examples
pythonbeautifulsoupscreen-scraping

Using `find_all()` to get all tags of a subset of that same tag


I'm trying to find all the <a> HTML tags of a specific type from an html doc.

My code:

for i in top_url_list:
    r = requests.get(top_url_list[i])
    soup = BeautifulSoup(r.content)

At this point I need to pull out (with some regex) part of a link in an href tag.

The tag looks like this:

"<a href="/players/a/abdelal01.html">Alaa Abdelnaby</a>"

There are other <a href...> tags that don't follow this convention that I don't want to find_all() on.

What can I pass find_all() to retrieve the right set of href tags I need to work on?


Solution

  • There are other links on the page that don't follow that convention because they aren't links to player pages, they might be links to team pages and whatnot.

    I would then check if href starts with /players:

    for link in soup.select('a[href^="/players"]'):
        print(link["href"]) 
    

    Or, contains players:

    for link in soup.select('a[href*=players]'):
        print(link["href"]) 
    

    Since you are interested only in the html filename, split by / and get the last item:

    print(link["href"].split("/")[-1])