Search code examples
pythonhtmlregexbeautifulsouphref

How to expand BeautifulSoup HREF search from <a> to <td>


I'm trying to find certain hrefs within HTML and I had been using (which had been working):

for a in soup.find_all('a', href=True):
    if a['href'].startswith('/game/'):
        chunk = str(a).split('''"''')
        game = chunk[3]

for the following HTML:

<td colspan="4">
    <a href="/game/index/4599712?org id=418" class="skipMask" target="TEAM_WIN">35-28 </a>
</td>

my code successfully gave me the /game/index/4599712?org id=418

However, there are other tags that have separate hrefs for the teams, and the record of the teams. Example:

<td nowrap bgcolor="#FFFFFF">
    <a href="/team/145/18741">Philadelphia</a> == $0
    " (3-1)                                     "
</td>

I would like some advice with this. I THINK I want to 1) if the href starts with "/game/" id like to have a better way of getting that href than splitting on quotation marks (probably regular expressions?). 2) If the href starts with "/team/" Id like to be able to create a dictionary to pair Philadelphia with (3-1). Any suggestions or ideas would be appreciated.


Solution

  • To grab all href that start with /game/ just append the found node href value to a list:

    >>> result1 = []
    >>> for a in soup.find_all('a', href=True):
        if a['href'].startswith('/game/'):
            result1.append(a['href'])
    
    >>> print(result1)
    ['/game/index/4599712?org id=418']
    

    As for the second one, you may use a regex, but on the plain text of the next sibling to a:

    >>> import re
    >>> result2 = {}
    >>> for a in soup.find_all('a', href=True):
        if a['href'].startswith('/team/'):
            m = re.search(r"\((\d+-\d+)\)", a.next_sibling.string)
            if m:
                result2[a.string] = m.group(1)
            else:
                result2[a.string] = ""
    
    >>> print(result2)
    {'Philadelphia': '3-1'}
    

    The \((\d+-\d+)\) will extract digits + - + digits that are inside parentheses. If this value is not present, a key-value will be added with the found key, but an empty value.