Search code examples
pythonpython-itertools

iterative long-to-wide python one-liner (or two) using groupby


I'm looking to turn a long dataset into a wide one using functional and iterative tools, and my understanding is that this is a task for groupby. I've asked a couple of questions about this before, and thought I had it, but not quite in this case, which ought to be simpler:

Here's the data I have:

from itertools import groupby
from operator import itemgetter
from pprint import pprint

>>> longdat=[
{"id":"cat", "name" : "best meower", "value": 10},
{"id":"cat", "name" : "cleanest paws", "value": 8},
{"id":"cat", "name" : "fanciest", "value": 9},
{"id":"dog", "name" : "smelly", "value": 9},
{"id":"dog", "name" : "dumb", "value": 9},
]

Here's the format I want it in:

>>> widedat=[
{"id":"cat", "best meower": 10, "cleanest paws": 8, "fanciest": 9},
{"id":"dog", "smelly": 9, "dumb": 9},
]

Here are my failed attempts:

# WRONG
>>> gh = groupby(sorted(longdat,key=id),itemgetter('id'))
>>> list(gh)
[('cat', <itertools._grouper object at 0x5d0b550>), ('dog', <itertools._grouper object at 0x5d0b210>)]

OK, need to get the second item out of the iterator, fair enough.

#WRONG
>>> gh = groupby(sorted(longdat,key=id),itemgetter('id'))
>>> for g,v in gh:
...     {"id":i["id"], i["name"]:i["value"] for i in v}
                                      ^
SyntaxError: invalid syntax

Weird, it looked valid. Let's unwind those loops to make sure.

#WRONG
gb = groupby(sorted(longdat,key=id),itemgetter('id'))
data = {}
for g,v in gb:
    data[g] = {}
    for i in v:
        data[g] = i

#WRONG
gb = groupby(sorted(longdat,key=id),itemgetter('id'))
data = []
for g,v in gb:
    for i in v:
        data[g] = i

Ah! OK, let's go back to the one-line form

#WRONG
>>> gb = groupby(sorted(longdat,key=id),itemgetter('id'))
>>> [{"id":g, i["name"]:i["value"]} for i in k for g,k in gb]
[]

What? Why empty?! Let's unwind basically exactly this again:

#WRONG
gb = groupby(sorted(longdat,key=id),itemgetter('id'))
for g,k in gb:
    for i in k:
       print(g, i["name"],i["value"])
cat best meower 10
cat fanciest 9
cat cleanest paws 8
dog smelly 9
dog dumb 9

Now, this last one is obviously the worst---it's clear my data is basically right back where it started, as if I didn't even groupby.

Why is this not working and how can I get this in the format I'm seeking?

Also, is it possibly to phrase this entirely iteratively such that I could do

>>> result[0]
{"id":"cat", "best meower": 10, "cleanest paws": 8, "fanciest": 9}

and only get the first result without processing the entire list (beyond having to look at /all/ where id == 'cat'?)


Solution

  • key function passed to the sorted function is id. It will return all different values for all list items.

    It should be itemgetter('id') or lambda x: x.id.

    >>> id(longdat[0])
    41859624L
    >>> id(longdat[1])
    41860488L
    >>> id(longdat[2])
    41860200L
    >>> itemgetter('id')(longdat[1])
    'cat'
    >>> itemgetter('id')(longdat[2])
    'cat'
    >>> itemgetter('id')(longdat[3])
    'cat'
    

    from itertools import groupby
    from operator import itemgetter
    
    longdat = [
        {"id":"cat", "name" : "best meower", "value": 10},
        {"id":"cat", "name" : "cleanest paws", "value": 8},
        {"id":"cat", "name" : "fanciest", "value": 9},
        {"id":"dog", "name" : "smelly", "value": 9},
        {"id":"dog", "name" : "dumb", "value": 9},
    ]
    
    getid = itemgetter('id')
    result = [
        dict([['id', key]] + [[d['name'], d['value']] for d in grp])
        for key, grp in groupby(sorted(longdat, key=getid), key=getid)
    ]
    print(result)
    

    output:

    [{'best meower': 10, 'fanciest': 9, 'id': 'cat', 'cleanest paws': 8},
     {'dumb': 9, 'smelly': 9, 'id': 'dog'}]