Search code examples
pythonbeautifulsoupscreen-scrapingurllib

Python web scraping - how to get resources with beautiful soup when page loads contents via JS?


So I am trying to scrape a table from a specific website using BeautifulSoup and urllib. My goal is to create a single list from all the data in this table. I have tried using this same code using tables from other websites, and it works fine. However, while trying it with this website the table returns a NoneType object. Can someone help me with this? I've tried looking for other answers online but I'm not having much luck.

Here's the code:

import requests
import urllib

from bs4 import BeautifulSoup
soup = BeautifulSoup(urllib.request.urlopen("http://www.teamrankings.com/ncaa-basketball/stat/free-throw-pct").read())

table = soup.find("table", attrs={'class':'sortable'})

data = []
rows = table.findAll("tr")
for tr in rows:
    cols = tr.findAll("td")
    for td in cols:
        text = ''.join(td.find(text=True))
        data.append(text)

print(data)

Solution

  • It looks like this data is loaded via an ajax call:

    enter image description here

    You should target that url instead: http://www.teamrankings.com/ajax/league/v3/stats_controller.php

    import requests
    import urllib
    
    from bs4 import BeautifulSoup
    
    
    params = {
        "type":"team-detail",
        "league":"ncb",
        "stat_id":"3083",
        "season_id":"312",
        "cat_type":"2",
        "view":"stats_v1",
        "is_previous":"0",
        "date":"04/06/2015"
    }
    
    content = urllib.request.urlopen("http://www.teamrankings.com/ajax/league/v3/stats_controller.php",data=urllib.parse.urlencode(params).encode('utf8')).read()
    soup = BeautifulSoup(content)
    
    table = soup.find("table", attrs={'class':'sortable'})
    
    data = []
    rows = table.findAll("tr")
    for tr in rows:
        cols = tr.findAll("td")
        for td in cols:
            text = ''.join(td.find(text=True))
            data.append(text)
    
    print(data)
    

    Using your web inspector you can also view the parameters that are passed along with the POST request.

    enter image description here

    Generally the server on the other end will check for these values and reject your request if you do not have some or all of them. The above code snippet ran fine for me. I switched to urllib2 because I generally prefer to use that library.

    If the data loads in your browser it is possible to scrape it. You just need to mimic the request your browser sends.