Search code examples
pythonhtmlweb-scrapingtext-processingstring-parsing

How to work with data from NBA.com?


I found Greg Reda's blog post about scraping HTML from nba.com:

http://www.gregreda.com/2015/02/15/web-scraping-finding-the-api/

I tried to work with the code he wrote there:

import requests
import json

url = 'http://stats.nba.com/stats/leaguedashteamshotlocations?Conference=&DateFr' + \
      'om=&DateTo=&DistanceRange=By+Zone&Division=&GameScope=&GameSegment=&LastN' + \
      'Games=0&LeagueID=00&Location=&MeasureType=Opponent&Month=0&OpponentTeamID' + \
      '=0&Outcome=&PORound=0&PaceAdjust=N&PerMode=PerGame&Period=0&PlayerExperien' + \
      'ce=&PlayerPosition=&PlusMinus=N&Rank=N&Season=2014-15&SeasonSegment=&Seas' + \
      'onType=Regular+Season&ShotClockRange=&StarterBench=&TeamID=0&VsConference=&VsDivision='

response = requests.get(url)
response.raise_for_status()
shots = response.json()['resultSets']['rowSet']

avg_percentage = shots['OPP_FG_PCT']

print(avg_percentage)

But it returns:

Traceback (most recent call last):
  File "C:\Python34\nba.py", line 91, in <module>
    avg_percentage = shots['OPP_FG_PCT']
TypeError: list indices must be integers, not str

I know only basic Python so I couldn't figure out how to get a list of integers from the data. Can anybody explain?


Solution

  • Evidently the data structure has changed since Greg Reda wrote that post. Before exploring the data, I recommend that you save it to a file via pickling. That way you don't have to keep hitting the NBA server and waiting for a download each time you modify and rerun the script.

    The following script checks for the existence of the pickled data to avoid unnecessary downloading:

    import requests
    import json
    
    url = 'http://stats.nba.com/stats/leaguedashteamshotlocations?Conference=&DateFr' + \
          'om=&DateTo=&DistanceRange=By+Zone&Division=&GameScope=&GameSegment=&LastN' + \
          'Games=0&LeagueID=00&Location=&MeasureType=Opponent&Month=0&OpponentTeamID' + \
          '=0&Outcome=&PORound=0&PaceAdjust=N&PerMode=PerGame&Period=0&PlayerExperien' + \
          'ce=&PlayerPosition=&PlusMinus=N&Rank=N&Season=2014-15&SeasonSegment=&Seas' + \
          'onType=Regular+Season&ShotClockRange=&StarterBench=&TeamID=0&VsConference=&VsDivision='
    print(url)
    
    import sys, os, pickle
    file_name = 'result_sets.pickled'
    
    if os.path.isfile(file_name):
      result_sets = pickle.load(open(file_name, 'rb'))
    else: 
      response = requests.get(url)
      response.raise_for_status()
      result_sets = response.json()['resultSets']
      pickle.dump(result_sets, open(file_name, 'wb'))
    
    print(result_sets.keys())
    print(result_sets['headers'][1])
    print(result_sets['rowSet'][0])
    print(len(result_sets['rowSet']))
    

    Once you have result_sets in hand, you can examine the data. If you print it, you'll see that it's a dictionary. You can extract the dictionary keys:

    print(result_sets.keys())
    

    Currently the keys are 'headers', 'rowSet', and 'name'. You can inspect the headers:

    print(result_sets['headers'])
    

    I probably know less about these statistics than you do. However, by looking at the data, I've been able to figure out that result_sets['rowSet'] contains 30 rows of 23 elements each. The 23 columns are identified by result_sets['headers'][1]. Try this:

    print(result_sets['headers'][1])
    

    That will show you the 23 column names. Now take a look at the first row of team data:

    print(result_sets['rowSet'][0])
    

    Now you see the 23 values reported for the Atlanta Hawks. You can iterate over the rows in result_sets['rowSet'] to extract whatever values interest you and to compute aggregate information such as totals and averages.