I have this:
entity_key = 'pid'
data = [ { ... }, { ... } ]
entities = list(set([ row[entity_key] for row in data ]))
parsed = []
total_keys = ['a','b','c']
for entity_id in entities:
entity_rows = [ row for row in data if row[entity_key] == entity_id ]
totals = { key: sum(filter(None, [ row.get(key) for row in entity_rows ])) for key in total_keys }
totals[entity_key] = entity_id
parsed.append(totals)
return parsed
In my scenario, data
is about 30,000 items, it's large.
Each item is a dict
, each dict
contains the identifier pid
, and numeric values for each item defined in total_keys
, e.g. { 'pid': 5011, 'a': 3, 'b': 20, 'c': 33 }
As you see, the code returns a list of unique rows for each pid
, with the summed columns defined in the total_keys
list. There may be about 800-1000 unique pid
values, so parsed
ends up being about 800-1000 items.
It's slow. I've tried to re-write this using itertools.groupby
but it doesn't seem the best fit. Is there some magic I'm missing?
make a single dict using pids as the outer keys:
entity_key = 'pid'
data = [ { 'pid': 5011, 'a': 3, 'b': 20, 'c': 33 },{ 'pid': 5012, 'a': 3, 'b': 20, 'c': 33 },
{ 'pid': 5011, 'a': 3, 'b': 20, 'c': 33 },{ 'pid': 5012, 'a': 3, 'b': 20, 'c': 33 }]
from collections import defaultdict
totals = ["a", "b", "c"]
dfd = defaultdict(lambda: {"a":0, "b", 0, "c": 0})
for d in data:
for k in d.keys() & totals:
dfd[d["pid"]][k] += d[k]
Output will be all your pid's grouped and any a b or c key values summed:
defaultdict(<function <lambda> at 0x7f2cf93ed2f0>,
{5011: {'a': 6, 'c': 66, 'b': 40}, 5012: {'a': 6, 'c': 66, 'b': 40}})
For python2 you need to use uni = d.viewkeys() & totals
If you data was actually grouped you could yield a group at a time:
from collections import defaultdict
from itertools import groupby
from operator import itemgetter
def yield_d(data,k, keys):
for k,v in groupby(data, key=itemgetter(k)):
d = defaultdict(lambda: dict.fromkeys(keys, 0))
for dct in v:
for _k in dct.keys() & keys:
d[k][_k] += dct[_k]
yield d