Search code examples
pythonweb-scrapingbeautifulsoupfindall

BeautifulSoup - Find <td> while filtering out <a> possibly?


I am having some trouble scraping a certain site as most of the info is buried weirdly and also not a consistent table size.

Here is an example of the HTML:

<tbody>
    <tr>
        <td>
            <a href="LINK">Player1</a>
        </td>
        <td>Position1</td>
        <td>
            <b>Player1 Injury</b>
            <br>
            "Date of injury1"
        </td>
        <td>
            <a href="LINK" class="BUTTON"></a>
        </td>
    </tr>
    <tr class="COLLAPSE"></tr>
    <tr>
        <td>
            <a href="LINK">Player2</a>
        </td>
        <td>Position2</td>
        <td>
            <b>Player2 Injury</b>
            <br>
            "Date of injury2"
        </td>
        <td>
            <a href="LINK" class="BUTTON"></a>
        </td>
    </tr>
    <tr class="COLLAPSE"></tr>
</tbody>

Given this data, all I am trying to do is pull the <td>'s with the Player's injuries and the date of their injury.

If I do a

injury.find_all('td')

Of course, I am going to get all the extra data that I am not looking for. All of the data I would want to pull will always be in that 3rd <td> tag, but I will also need to find the 3rd <td> tag again when in a new tag. Filtering out the class="COLLAPSE" should be easily achieved to hopefully not make that an issue.

So, the result of scraping this data, I would like the result:

['Player1 Injury', 'Date of injury1', 'Player2 Injury', 'Date of injury2']

All help is greatly appreciated.


Solution

  • Thanks for posting the html. Using that as an example, I think we need to iterate over each <tr> tag within the <tbody> tag, checking if it has the "COLLAPSE" class or not.

    If the <tr> tag doesn't have the "COLLAPSE" class, then you can find all the <td> tags inside it and extract the third one (index 2) which contains the player's injury and the date of their injury.

    Code below:

    from bs4 import BeautifulSoup
    
    # HTML code
    html = """
    <tbody>
        <tr>
            <td>
                <a href="LINK">Player1</a>
            </td>
            <td>Position1</td>
            <td>
                <b>Player1 Injury</b>
                <br>
                "Date of injury1"
            </td>
            <td>
                <a href="LINK" class="BUTTON"></a>
            </td>
        </tr>
        <tr class="COLLAPSE"></tr>
        <tr>
            <td>
                <a href="LINK">Player2</a>
            </td>
            <td>Position2</td>
            <td>
                <b>Player2 Injury</b>
                <br>
                "Date of injury2"
            </td>
            <td>
                <a href="LINK" class="BUTTON"></a>
            </td>
        </tr>
        <tr class="COLLAPSE"></tr>
    </tbody>
    """
    
    # Parse the HTML
    soup = BeautifulSoup(html, 'html.parser')
    
    # Find all <tr> tags within the <tbody> tag
    trs = soup.tbody.find_all('tr')
    
    # Extract the player's injury and the date of their injury from each <tr> tag
    injuries = []
    for tr in trs:
        if not tr.has_attr('class') or 'COLLAPSE' not in tr['class']:
            tds = tr.find_all('td')
            injury = tds[2].b.get_text().strip()
            date = tds[2].find_all('br')[-1].next_sibling.strip()
            injuries.append(injury)
            injuries.append(date)
    
    print(injuries) 
    
    # Output: ['Player1 Injury', 'Date of injury1', 'Player2 Injury', 'Date of injury2']