Search code examples
pythonweb-scrapingbeautifulsouppython-requests

Why Beautiful Soup find_all does not find all matching elements in page?


What am I trying to achieve?

I am trying to scrape the "Player Shooting" table from this webpage. More specifically I want to return the tr tags from the stats_shooting table as a list (with one tr per element of the list).

What have I done so far?

I return the web page using the block below:

# Request page
all_players_shooting_url = "https://fbref.com/en/comps/9/shooting/Premier-League-Stats"
html = requests.get(all_players_shooting_url)
assert html.status_code == 200, f"Status code of {html.status_code} was returned."
soup = bs(html, 'html.parser')

Where have I encountered problems / and what I have done to resolve them

I have then tried a number of approaches to get to the data that I need:

Simple find all method - this gives me the outer information but I cant search it further to get the tr's

granular_search = soup.find_all("div", {"id": "all_stats_shooting"})
print(f"Granular search returns {len(granular_search)} results. Expected 1.")

Brute force return of all table tags from the page. This doesn't return the table I care about...

broad_search = soup.find_all("table", recursive=True)
print(f"Broad search returns {len(broad_search)} results. Expected 3.")

Some joy returning the table using the CSS Selector (I actually get something back...) but not able to search it further to get the tr's...

css_search = soup.select("#all_stats_shooting")
print(f"CSS search returns {len(css_search)} results. Expected 1.")
further_search = css_search[0].find_all("tr")
print(f"Further search returns {len(further_search)} results. Expected > 0.")

I can attempt to return all elements with a tr tag, but again it only returns the first two tables...

tr_search = soup.find_all('tr')
print(f"Tr search returns {len(tr_search)} results. Expected > 44")

Please note: I have also developed a solution using Selenium. It works but it's slow and unstable. With this in mind, some of the existing answers e.g. this one don't really solve my problem.


Solution

  • Main issue here is, that the table you try to find is stored in comments, so you have to comment it out first:

    soup = bs(html.text.replace('<!--','').replace('-->',''), 'html.parser')
    

    Then to select only the data rows adjust your css selector:

    soup.select("#all_stats_shooting table tr:has(td)")
    

    To scrape the table and store it directly to dataframe use pandas - check and adapt following question How to extract hidden table from fbref website by id?