I'm currently trying to scrape a statistics site in Python 3.7 using BeautifulSoup. I'm trying to grab all of the headers from a table as my column headers, but for some reason BeautifulSoup isn't grabbing all of the headers that are located within the 'th' tags.
Here is my code:
url = 'https://www.eliteprospects.com/team/552/guelph-storm/2005-2006?tab=stats'
html = urlopen(url)
scraper = BeautifulSoup(html,'html.parser')
column_headers = [th.getText() for th in scraper.findAll('tr', limit=1)[0].findAll('th')] # Find Column Headers.
print(column_headers)
Here is the output I am getting: ['#', 'Player', 'GP', 'G', 'A', 'TP']
Here is the output I should be getting: ['#', 'Player', 'GP', 'G', 'A', 'TP', 'PIM', '+/-', 'GP', 'G', 'A', 'TP', 'PIM', '+/-']
For Reference here is what the table source html looks like:
<table class="table table-striped table-sortable skater-stats highlight-stats" data-sort-url="https://www.eliteprospects.com/team/552/guelph-storm/2005-2006?tab=stats" data-sort-ajax-container="#players" data-sort-ajax-url="https://www.eliteprospects.com/ajax/team.player-stats?teamId=552&season=2005-2006&position=">
<thead style="background-color: #fff">
<tr style="background-color: #fff">
<th class="position">#</th>
<th class="player sorted" data-sort="player">Player<i class="fa fa-caret-down"></i></th>
<th class="gp" data-sort="gp">GP</th>
<th class="g" data-sort="g">G</th>
<th class="a" data-sort="a">A</th>
<th class="tp" data-sort="tp">TP</th>
<th class="pim" data-sort="pim">PIM</th>
<th class="pm" data-sort="pm">+/-</th>
<th class="separator"> </th>
<th class="playoffs gp" data-sort="playoffs-gp">GP</th>
<th class="playoffs g" data-sort="playoffs-g">G</th>
<th class="playoffs a" data-sort="playoffs-a">A</th>
<th class="playoffs tp" data-sort="playoffs-tp">TP</th>
<th class="playoffs pim" data-sort="playoffs-pim">PIM</th>
<th class="playoffs pm" data-sort="playoffs-pm">+/-</th>
</tr>
</thead>
<tbody>
Any help would be greatly appreciated!
Looking at the source of the page you are trying to scrape, this is exactly what the data looks like:
<div class="table-wizard">
<table class="table table-striped">
<thead>
<tr>
<th class="position">#</th>
<th class="player">Player</th>
<th class="gp">GP</th>
<th class="g">G</th>
<th class="a">A</th>
<th class="sorted tp">TP</th>
</tr>
</thead>
<tbody>
That is why that is the only data you get. It's not even a case where JavaScript alters it after the fact. If I perform a querySelector
in the browser console, I get the same thing:
> document.querySelector('tr')
> <tr>
<th class="position">#</th>
<th class="player">Player</th>
<th class="gp">GP</th>
<th class="g">G</th>
<th class="a">A</th>
<th class="sorted tp">TP</th>
</tr>
In short, Beautiful Soup is giving you exactly all the th
tags in the first tr
tag.
If you try and grab the second tr
tag that has th
tags using the CSS selector tr:has(th)
, you will see you get more th
tags:
column_headers = [th.getText() for th in scraper.select('tr:has(th)', limit=2)[1].findAll('th')]
Output
['#', 'Player', 'GP', 'G', 'A', 'TP', 'PIM', '+/-', '\xa0', 'GP', 'G', 'A', 'TP', 'PIM', '+/-']