I am using a public API at www.gpcontract.co.uk
to populate a large variably nested dictionary representing a hierarchy of UK health organisations.
Some background information
The top level of the hierarchy is the four UK countries (England, Scotland, Wales and Northern Ireland), then regional organisations all the way down to individual clinics. The depth of the hierarchy is different for each of the countries and can change depending on the year. Each organisation has a name, orgcode and dictionary listing its child organisations.
Unfortunately, the full nested hierarchy is not available from the API, but calls to http://www.gpcontract.co.uk/api/children/[organisation code]/[year]
will return the immediate child organisations of any other.
So that the hierarchy can be easily navigated in my app, I want to generate an offline dictionary of this full hierarchy (on a per year basis) which will be saved using pickle
and bundled with the app.
Getting this means a lot of API calls, and I am having trouble converting the returned JSON into the dictionary object I require.
Here is an example of one tiny part of the hierarchy (I have only shown a single child organisation as an example).
JSON hierarchy example
{
"eng": {
"name": "England",
"orgcode": "eng",
"children": {}
},
"sco": {
"name": "Scotland",
"orgcode": "sco",
"children": {}
},
"wal": {
"name": "Wales",
"orgcode": "wal",
"children": {}
},
"nir": {
"name": "Northern Ireland",
"orgcode": "nir",
"children": {
"blcg": {
"name": "Belfast Local Commissioning Group",
"orgcode": "blcg",
"children": {
"abc": {
"name": "Random Clinic",
"orgcode": "abc",
"children": {}
}
}
}
}
}
}
Here’s the script I’m using to make the API calls and populate the dictionary:
My script
import json, pickle, urllib.request, urllib.error, urllib.parse
# Organisation hierarchy may vary between years. Set the year here.
year = 2017
# This function returns a list containing a dictionary for each child organisation with keys for name and orgcode
def get_child_orgs(orgcode, year):
orgcode = str(orgcode)
year = str(year)
# Correct 4-digit year to 2-digit
if len(year) > 2:
year = year[2:]
try:
child_data = json.loads(urllib.request.urlopen('http://www.gpcontract.co.uk/api/children/' + str(orgcode) + '/' + year).read())
output = []
if child_data != []:
for item in child_data['children']:
output.append({'name' : item['name'], 'orgcode' : str(item['orgcode']).lower(), 'children' : {}})
return output
except urllib.error.HTTPError:
print('HTTP error!')
except:
print('Other error!')
# I start with a template of the top level of the hierarchy and then populate it
hierarchy = {'eng' : {'name' : 'England', 'orgcode' : 'eng', 'children' : {}}, 'nir' : {'name' : 'Northern Ireland', 'orgcode' : 'nir', 'children' : {}}, 'sco' : {'name' : 'Scotland', 'orgcode' : 'sco', 'children' : {}}, 'wal' : {'name' : 'Wales', 'orgcode' : 'wal', 'children' : {}}}
print('Loading data...\n')
# Here I use nested for loops to make API calls and populate the dictionary down the levels of the hierarchy. The bottom level contains the most items.
for country in ('eng', 'nir', 'sco', 'wal'):
for item1 in get_child_orgs(country, year):
hierarchy[country]['children'][item1['orgcode']] = item1
for item2 in get_child_orgs(item1['orgcode'], year):
hierarchy[country]['children'][item1['orgcode']]['children'][item2['orgcode']] = item2
# Only England and Wales hierarchies go deeper than this
if country in ('eng', 'wal'):
level3 = get_child_orgs(item2['orgcode'], year)
# Check not empty array
if level3 != []:
for item3 in level3:
hierarchy[country]['children'][item1['orgcode']]['children'][item2['orgcode']]['children'][item3['orgcode']] = item3
level4 = get_child_orgs(item3['orgcode'], year)
# Check not empty array
if level4 != []:
for item4 in level4:
hierarchy[country]['children'][item1['orgcode']]['children'][item2['orgcode']]['children'][item3['orgcode']]['children'][item4['orgcode']] = item4
# Save the completed hierarchy with pickle
file_name = 'hierarchy_' + str(year) + '.dat'
with open(file_name, 'wb') as out_file:
pickle.dump(hierarchy, out_file)
print('Success!')
The problem
This seems to work most of the time, but it feels hacky and sometimes crashes when a nested for loop returns a "NoneType is not iterable error". I realise this is making a lot of API calls and takes several minutes to run, but I cannot see a way around this, as I want the completed hierarchy available offline for the user to make the data searchable quickly. I will then use the API in a slightly different way to get the actual healthcare data for the chosen organisation.
My question
Is there a cleaner and more flexible way to do this that would accommodate the variable nesting of the organisation hierarchy?
Is there a way to do this significantly more quickly?
I am relatively inexperienced with JSON so any help would be appreciated.
I think this question may be better suited over on the Code Review Stack Exchange, but as you mention that your code sometimes crashes and returns NoneType
errors I'll give it the benefit of the doubt.
Looking at your description, this is what stands out to me
Each organisation has a name, orgcode and dictionary listing its child organisations. [API calls] will return the immediate child organisations of any other.
So, what this suggests to me (and how it looks in your sample data) is that all your data is exactly equivalent; the hierarchy only exists due to the nesting of the data and is not enforced by the format of any particular node.
This, consequently, means that you should be able to have a single piece of code which handles the nesting of an infinitely (or arbitrarily, if you prefer) deep tree. Obviously, you do this for the API call itself (get_child_orgs()
), so just replicate that for constructing the tree.
def populate_hierarchy(organization,year):
""" Recursively Populate the Organization Hierarchy
organization should be a dict with an "orgcode" key with a string value
and "children" key with a dict value.
year should be a 2-4 character string representing a year.
"""
orgcode = organization['orgcode']
## get_child_orgs returns a list of organizations
children = get_child_orgs(orgcode,year)
## get_child_orgs returns None on Errors
if children:
for child in children:
## Add child to the current organization's children, using
## orgcode as its key
organization['children'][child['orgcode']] = child
## Recursively populate the child's sub-hierarchy
populate_hierarchy(child,year)
## Technically, the way this is written, returning organization is
## pointless because we're modifying organization in place, but I'm
## doing it anyway to explicitly denote the end of the function
return organization
for country in hierarchy.values():
populate_hierarchy(country,year)
It's worth noting (since you were checking for empty lists prior to iterating in your original code) that for x in y
still functions correctly if y
is an empty list, so you don't need to check.
The NoneType
Error likely arises because you catch the Error in get_child_orgs
and then implicitly return None
. Therefore, for example level3 = get_child_orgs[etc...]
results in level3 = None
; this leads to if None != []:
in the next line being True, and then you try to iterate over None
with for item3 in None:
which would raise the error. As noted in the code above, this is why I check the truthiness of children
.
As for whether this can be done more quickly, you can try working with the threading/multiprocessing
modules. I just don't know how profitable either of those will be for three reasons:
Finally, I would simply question whether pickle
is the appropriate method of storing the information, or if you wouldn't just be better off using json.dump/load
(for the record, the json
module doesn't care if you change the extension to .dat
if you're partial to that extension name).