python html web-scraping text-processing string-parsing

How to work with data from NBA.com?

I found Greg Reda's blog post about scraping HTML from nba.com:

http://www.gregreda.com/2015/02/15/web-scraping-finding-the-api/

I tried to work with the code he wrote there:

import requests
import json

url = 'http://stats.nba.com/stats/leaguedashteamshotlocations?Conference=&DateFr' + \
      'om=&DateTo=&DistanceRange=By+Zone&Division=&GameScope=&GameSegment=&LastN' + \
      'Games=0&LeagueID=00&Location=&MeasureType=Opponent&Month=0&OpponentTeamID' + \
      '=0&Outcome=&PORound=0&PaceAdjust=N&PerMode=PerGame&Period=0&PlayerExperien' + \
      'ce=&PlayerPosition=&PlusMinus=N&Rank=N&Season=2014-15&SeasonSegment=&Seas' + \
      'onType=Regular+Season&ShotClockRange=&StarterBench=&TeamID=0&VsConference=&VsDivision='

response = requests.get(url)
response.raise_for_status()
shots = response.json()['resultSets']['rowSet']

avg_percentage = shots['OPP_FG_PCT']

print(avg_percentage)

But it returns:

Traceback (most recent call last):
  File "C:\Python34\nba.py", line 91, in <module>
    avg_percentage = shots['OPP_FG_PCT']
TypeError: list indices must be integers, not str

I know only basic Python so I couldn't figure out how to get a list of integers from the data. Can anybody explain?

Solution

Evidently the data structure has changed since Greg Reda wrote that post. Before exploring the data, I recommend that you save it to a file via pickling. That way you don't have to keep hitting the NBA server and waiting for a download each time you modify and rerun the script.

The following script checks for the existence of the pickled data to avoid unnecessary downloading:

import requests
import json

url = 'http://stats.nba.com/stats/leaguedashteamshotlocations?Conference=&DateFr' + \
      'om=&DateTo=&DistanceRange=By+Zone&Division=&GameScope=&GameSegment=&LastN' + \
      'Games=0&LeagueID=00&Location=&MeasureType=Opponent&Month=0&OpponentTeamID' + \
      '=0&Outcome=&PORound=0&PaceAdjust=N&PerMode=PerGame&Period=0&PlayerExperien' + \
      'ce=&PlayerPosition=&PlusMinus=N&Rank=N&Season=2014-15&SeasonSegment=&Seas' + \
      'onType=Regular+Season&ShotClockRange=&StarterBench=&TeamID=0&VsConference=&VsDivision='
print(url)

import sys, os, pickle
file_name = 'result_sets.pickled'

if os.path.isfile(file_name):
  result_sets = pickle.load(open(file_name, 'rb'))
else: 
  response = requests.get(url)
  response.raise_for_status()
  result_sets = response.json()['resultSets']
  pickle.dump(result_sets, open(file_name, 'wb'))

print(result_sets.keys())
print(result_sets['headers'][1])
print(result_sets['rowSet'][0])
print(len(result_sets['rowSet']))

Once you have result_sets in hand, you can examine the data. If you print it, you'll see that it's a dictionary. You can extract the dictionary keys:

print(result_sets.keys())

Currently the keys are 'headers', 'rowSet', and 'name'. You can inspect the headers:

print(result_sets['headers'])

I probably know less about these statistics than you do. However, by looking at the data, I've been able to figure out that result_sets['rowSet'] contains 30 rows of 23 elements each. The 23 columns are identified by result_sets['headers'][1]. Try this:

print(result_sets['headers'][1])

That will show you the 23 column names. Now take a look at the first row of team data:

print(result_sets['rowSet'][0])

Now you see the 23 values reported for the Atlanta Hawks. You can iterate over the rows in result_sets['rowSet'] to extract whatever values interest you and to compute aggregate information such as totals and averages.