I'm trying to find all the <a>
HTML tags of a specific type from an html doc.
My code:
for i in top_url_list:
r = requests.get(top_url_list[i])
soup = BeautifulSoup(r.content)
At this point I need to pull out (with some regex) part of a link in an href
tag.
The tag looks like this:
"<a href="/players/a/abdelal01.html">Alaa Abdelnaby</a>"
There are other <a href...>
tags that don't follow this convention that I don't want to find_all()
on.
What can I pass find_all()
to retrieve the right set of href
tags I need to work on?
There are other links on the page that don't follow that convention because they aren't links to player pages, they might be links to team pages and whatnot.
I would then check if href
starts with /players
:
for link in soup.select('a[href^="/players"]'):
print(link["href"])
Or, contains players
:
for link in soup.select('a[href*=players]'):
print(link["href"])
Since you are interested only in the html
filename, split by /
and get the last item:
print(link["href"].split("/")[-1])