Search code examples
pythonpython-2.7recursiondefaultdict

function to aggregate a set of data and output nested dictionary


I have looked all over for a solution to this problem and i can't find anything which works in the way that i am trying to achieve.

I want to create a Python function which has three arguments

  1. data_object - this is a list of dictionaries where each dictionary has the same fields - anywhere from 1-n amount of 'dimension' fields to group by, and anywhere from 1-n amount of metrics fields to be aggregated.
  2. dimensions - the list of dimension fields to group by
  3. metrics - the list of metric fields to aggregate

The way i have solved this problem previously is to use setdefault:

struc = {}
for row in rows:
    year = row['year']
    month = row['month']
    affiliate = row['affiliate']
    website = row['website']
    pgroup = row['product_group']
    sales = row['sales']
    cost = row['cost']
    struc.setdefault(year, {})
    struc[year].setdefault(month, {})
    struc[year][month].setdefault(affiliate, {})
    struc[year][month][affiliate].setdefault(website, {})
    struc[year][month][affiliate][website].setdefault(pgroup, {'sales':0, 'cost':0})
    struc[year][month][affiliate][website][pgroup]['sales'] += sales
    struc[year][month][affiliate][website][pgroup]['cost'] += cost

The problem is that the fieldnames, the amount of dimension fields, and the amount of metrics fields will all be different if i'm looking at a different set of data

I have seen posts about recursive functions and defaultdict but (unless i misunderstood them) they all either require you to know how many dimension and metric fields you want to work with OR they don't output a dictionary object which is what i require.


Solution

  • It was so much simpler than i thought :)

    My main problem was if you have n dimensions - how do you reference the correct level of the dictionary when you are looping through the dimensions for each row.

    I solved this by creating a pointer variable and pointing it to the newly made level of the dictionary everytime i created a new level

    def jsonify(data, dimensions, metrics, struc = {}):
        for row in data:
            pointer = struc
            for dimension in dimensions:
                pointer.setdefault(row[dimension], {})
                pointer = pointer[row[dimension]]
            for metric in metrics:
                pointer.setdefault(metric, 0)
                pointer[metric] += row[metric]
        return struc