Search code examples
python-3.xbeautifulsouphtml-parsing

BeautifulSoup Not Finding All 'th'


I'm currently trying to scrape a statistics site in Python 3.7 using BeautifulSoup. I'm trying to grab all of the headers from a table as my column headers, but for some reason BeautifulSoup isn't grabbing all of the headers that are located within the 'th' tags.

Here is my code:

url = 'https://www.eliteprospects.com/team/552/guelph-storm/2005-2006?tab=stats'
html = urlopen(url)
scraper = BeautifulSoup(html,'html.parser')
column_headers = [th.getText() for th in scraper.findAll('tr', limit=1)[0].findAll('th')] # Find Column Headers.
print(column_headers)

Here is the output I am getting: ['#', 'Player', 'GP', 'G', 'A', 'TP']

Here is the output I should be getting: ['#', 'Player', 'GP', 'G', 'A', 'TP', 'PIM', '+/-', 'GP', 'G', 'A', 'TP', 'PIM', '+/-']

For Reference here is what the table source html looks like:

<table class="table table-striped table-sortable skater-stats highlight-stats" data-sort-url="https://www.eliteprospects.com/team/552/guelph-storm/2005-2006?tab=stats" data-sort-ajax-container="#players" data-sort-ajax-url="https://www.eliteprospects.com/ajax/team.player-stats?teamId=552&amp;season=2005-2006&amp;position=">
                <thead style="background-color: #fff">
                    <tr style="background-color: #fff">
                        <th class="position">#</th>
                        <th class="player sorted" data-sort="player">Player<i class="fa fa-caret-down"></i></th>
                        <th class="gp" data-sort="gp">GP</th>
                        <th class="g" data-sort="g">G</th>
                        <th class="a" data-sort="a">A</th>
                        <th class="tp" data-sort="tp">TP</th>
                        <th class="pim" data-sort="pim">PIM</th>
                        <th class="pm" data-sort="pm">+/-</th>
                        <th class="separator">&nbsp;</th>
                        <th class="playoffs gp" data-sort="playoffs-gp">GP</th>
                        <th class="playoffs g" data-sort="playoffs-g">G</th>
                        <th class="playoffs a" data-sort="playoffs-a">A</th>
                        <th class="playoffs tp" data-sort="playoffs-tp">TP</th>
                        <th class="playoffs pim" data-sort="playoffs-pim">PIM</th>
                        <th class="playoffs pm" data-sort="playoffs-pm">+/-</th>
                    </tr>
                </thead>
                <tbody>

Any help would be greatly appreciated!


Solution

  • Looking at the source of the page you are trying to scrape, this is exactly what the data looks like:

        <div class="table-wizard">
            <table class="table table-striped">
                <thead>
                    <tr>
                        <th class="position">#</th>
                        <th class="player">Player</th>
                        <th class="gp">GP</th>
                        <th class="g">G</th>
                        <th class="a">A</th>
                        <th class="sorted tp">TP</th>
                    </tr>
                </thead>
                <tbody>
    

    That is why that is the only data you get. It's not even a case where JavaScript alters it after the fact. If I perform a querySelector in the browser console, I get the same thing:

    > document.querySelector('tr')
    > <tr>
          <th class="position">#</th>
          <th class="player">Player</th>
          <th class="gp">GP</th>
          <th class="g">G</th>
          <th class="a">A</th>
          <th class="sorted tp">TP</th>
      </tr>
    

    In short, Beautiful Soup is giving you exactly all the th tags in the first tr tag.

    If you try and grab the second tr tag that has th tags using the CSS selector tr:has(th), you will see you get more th tags:

    column_headers = [th.getText() for th in scraper.select('tr:has(th)', limit=2)[1].findAll('th')]
    

    Output

    ['#', 'Player', 'GP', 'G', 'A', 'TP', 'PIM', '+/-', '\xa0', 'GP', 'G', 'A', 'TP', 'PIM', '+/-']