Search code examples
pythonjsonmedical

Populating a variably nested dictionary with multiple API calls


I am using a public API at www.gpcontract.co.uk to populate a large variably nested dictionary representing a hierarchy of UK health organisations.

Some background information

The top level of the hierarchy is the four UK countries (England, Scotland, Wales and Northern Ireland), then regional organisations all the way down to individual clinics. The depth of the hierarchy is different for each of the countries and can change depending on the year. Each organisation has a name, orgcode and dictionary listing its child organisations.

Unfortunately, the full nested hierarchy is not available from the API, but calls to http://www.gpcontract.co.uk/api/children/[organisation code]/[year] will return the immediate child organisations of any other.

So that the hierarchy can be easily navigated in my app, I want to generate an offline dictionary of this full hierarchy (on a per year basis) which will be saved using pickle and bundled with the app.

Getting this means a lot of API calls, and I am having trouble converting the returned JSON into the dictionary object I require.

Here is an example of one tiny part of the hierarchy (I have only shown a single child organisation as an example).

JSON hierarchy example

{
  "eng": {
    "name": "England",
    "orgcode": "eng",
    "children": {}
  },
  "sco": {
    "name": "Scotland",
    "orgcode": "sco",
    "children": {}
  },
  "wal": {
    "name": "Wales",
    "orgcode": "wal",
    "children": {}
  },
  "nir": {
    "name": "Northern Ireland",
    "orgcode": "nir",
    "children": {
      "blcg": {
        "name": "Belfast Local Commissioning Group",
        "orgcode": "blcg",
        "children": {
          "abc": {
            "name": "Random Clinic",
            "orgcode": "abc",
            "children": {}
          }
        }
      }
    }
  }
}

Here’s the script I’m using to make the API calls and populate the dictionary:

My script

import json, pickle, urllib.request, urllib.error, urllib.parse

# Organisation hierarchy may vary between years. Set the year here.
year = 2017

# This function returns a list containing a dictionary for each child organisation with keys for name and orgcode
def get_child_orgs(orgcode, year):
    orgcode = str(orgcode)
    year = str(year)

    # Correct 4-digit year to 2-digit
    if len(year) > 2:
        year = year[2:]

    try:
        child_data = json.loads(urllib.request.urlopen('http://www.gpcontract.co.uk/api/children/' + str(orgcode) + '/' + year).read())

        output = []

        if child_data != []:
            for item in child_data['children']:
                output.append({'name' : item['name'], 'orgcode' : str(item['orgcode']).lower(), 'children' : {}})
        return output
    except urllib.error.HTTPError:
        print('HTTP error!')
    except:
        print('Other error!')

# I start with a template of the top level of the hierarchy and then populate it
hierarchy = {'eng' : {'name' : 'England', 'orgcode' : 'eng', 'children' : {}}, 'nir' : {'name' : 'Northern Ireland', 'orgcode' : 'nir', 'children' : {}}, 'sco' : {'name' : 'Scotland', 'orgcode' : 'sco', 'children' : {}}, 'wal' : {'name' : 'Wales', 'orgcode' : 'wal', 'children' : {}}}

print('Loading data...\n')

# Here I use nested for loops to make API calls and populate the dictionary down the levels of the hierarchy. The bottom level contains the most items.
for country in ('eng', 'nir', 'sco', 'wal'): 

    for item1 in get_child_orgs(country, year):
        hierarchy[country]['children'][item1['orgcode']] = item1

        for item2 in get_child_orgs(item1['orgcode'], year):
            hierarchy[country]['children'][item1['orgcode']]['children'][item2['orgcode']] = item2

            # Only England and Wales hierarchies go deeper than this
            if country in ('eng', 'wal'):

                level3 = get_child_orgs(item2['orgcode'], year)
                # Check not empty array
                if level3 != []:
                    for item3 in level3:
                        hierarchy[country]['children'][item1['orgcode']]['children'][item2['orgcode']]['children'][item3['orgcode']] = item3

                        level4 = get_child_orgs(item3['orgcode'], year)
                        # Check not empty array
                        if level4 != []:
                            for item4 in level4:
                                hierarchy[country]['children'][item1['orgcode']]['children'][item2['orgcode']]['children'][item3['orgcode']]['children'][item4['orgcode']] = item4

# Save the completed hierarchy with pickle
file_name = 'hierarchy_' + str(year) + '.dat'
with open(file_name, 'wb') as out_file:
    pickle.dump(hierarchy, out_file)

print('Success!')

The problem

This seems to work most of the time, but it feels hacky and sometimes crashes when a nested for loop returns a "NoneType is not iterable error". I realise this is making a lot of API calls and takes several minutes to run, but I cannot see a way around this, as I want the completed hierarchy available offline for the user to make the data searchable quickly. I will then use the API in a slightly different way to get the actual healthcare data for the chosen organisation.

My question

Is there a cleaner and more flexible way to do this that would accommodate the variable nesting of the organisation hierarchy?

Is there a way to do this significantly more quickly?

I am relatively inexperienced with JSON so any help would be appreciated.


Solution

  • I think this question may be better suited over on the Code Review Stack Exchange, but as you mention that your code sometimes crashes and returns NoneType errors I'll give it the benefit of the doubt.

    Looking at your description, this is what stands out to me

    Each organisation has a name, orgcode and dictionary listing its child organisations. [API calls] will return the immediate child organisations of any other.

    So, what this suggests to me (and how it looks in your sample data) is that all your data is exactly equivalent; the hierarchy only exists due to the nesting of the data and is not enforced by the format of any particular node.

    This, consequently, means that you should be able to have a single piece of code which handles the nesting of an infinitely (or arbitrarily, if you prefer) deep tree. Obviously, you do this for the API call itself (get_child_orgs()), so just replicate that for constructing the tree.

    def populate_hierarchy(organization,year):
        """ Recursively Populate the Organization Hierarchy
    
            organization should be a dict with an "orgcode" key with a string value
            and "children" key with a dict value.
    
            year should be a 2-4 character string representing a year.
        """
        orgcode = organization['orgcode']
    
        ## get_child_orgs returns a list of organizations
        children = get_child_orgs(orgcode,year)
    
        ## get_child_orgs returns None on Errors
        if children:
            for child in children:
    
                ## Add child to the current organization's children, using
                ## orgcode as its key
                organization['children'][child['orgcode']] = child
    
                ## Recursively populate the child's sub-hierarchy
                populate_hierarchy(child,year)
    
        ## Technically, the way this is written, returning organization is
        ## pointless because we're modifying organization in place, but I'm
        ## doing it anyway to explicitly denote the end of the function
        return organization
    
     for country in hierarchy.values():
         populate_hierarchy(country,year)
    

    It's worth noting (since you were checking for empty lists prior to iterating in your original code) that for x in y still functions correctly if y is an empty list, so you don't need to check.

    The NoneType Error likely arises because you catch the Error in get_child_orgs and then implicitly return None. Therefore, for example level3 = get_child_orgs[etc...] results in level3 = None; this leads to if None != []: in the next line being True, and then you try to iterate over None with for item3 in None: which would raise the error. As noted in the code above, this is why I check the truthiness of children.

    As for whether this can be done more quickly, you can try working with the threading/multiprocessing modules. I just don't know how profitable either of those will be for three reasons:

    1. I haven't tried out the API, so I don't know how much time you have to gain from implementing multiple threads/processes
    2. I have seen API's which timeout requests from IP Addresses when you query too quickly/too often (which would make the implementation pointless)
    3. You say you're only running this process once per year, so runtime in perspective of a full year seems pretty insignificant (obviously, unless the current API calls are taking literal days to complete)

    Finally, I would simply question whether pickle is the appropriate method of storing the information, or if you wouldn't just be better off using json.dump/load (for the record, the json module doesn't care if you change the extension to .dat if you're partial to that extension name).