I'm trying to find certain hrefs within HTML and I had been using (which had been working):
for a in soup.find_all('a', href=True):
if a['href'].startswith('/game/'):
chunk = str(a).split('''"''')
game = chunk[3]
for the following HTML:
<td colspan="4">
<a href="/game/index/4599712?org id=418" class="skipMask" target="TEAM_WIN">35-28 </a>
</td>
my code successfully gave me the /game/index/4599712?org id=418
However, there are other tags that have separate hrefs for the teams, and the record of the teams. Example:
<td nowrap bgcolor="#FFFFFF">
<a href="/team/145/18741">Philadelphia</a> == $0
" (3-1) "
</td>
I would like some advice with this. I THINK I want to 1) if the href starts with "/game/" id like to have a better way of getting that href than splitting on quotation marks (probably regular expressions?). 2) If the href starts with "/team/" Id like to be able to create a dictionary to pair Philadelphia with (3-1). Any suggestions or ideas would be appreciated.
To grab all href
that start with /game/
just append the found node href
value to a list:
>>> result1 = []
>>> for a in soup.find_all('a', href=True):
if a['href'].startswith('/game/'):
result1.append(a['href'])
>>> print(result1)
['/game/index/4599712?org id=418']
As for the second one, you may use a regex, but on the plain text of the next sibling to a
:
>>> import re
>>> result2 = {}
>>> for a in soup.find_all('a', href=True):
if a['href'].startswith('/team/'):
m = re.search(r"\((\d+-\d+)\)", a.next_sibling.string)
if m:
result2[a.string] = m.group(1)
else:
result2[a.string] = ""
>>> print(result2)
{'Philadelphia': '3-1'}
The \((\d+-\d+)\)
will extract digits + -
+ digits that are inside parentheses. If this value is not present, a key-value will be added with the found key, but an empty value.