Search code examples
pythonpandasdataframegroup-by

Pandas groupby with tag-style list


I have a dataset with 'tag-like' groupings:

     Id       tags
0    item1    ['friends','family']
1    item2    ['friends']
2    item3    []
3    item4    ['family','holiday']

So a row can belong to several groups. I want to create an object similar to groupby, so that I can use agg etc.

df.groupby('tags').count()

expected result

     tags          count
0    'friends'    2
1    'family'     2
2    'holiday'    1

But of course it won't work because it treats the whole list as the key, rather than the individual tags. Here's an attempt

tagset = set(df.tags.explode())
grpby  = { t: df.loc[df.tags.str.contains(t, regex=False)] 
           for t in tagset }

From what I understand, groupby objects are structured a bit like this. But how to make it a groupby object? So that I can do things like grpby.year.mean() etc?


Solution

  • You can't have a row belong to multiple groups like your grpby object. Thus what you want to do is impossible in pure pandas, unless you duplicate the rows with explode, then you will be able to groupby.agg:

    out = (df.explode('tags')
             .groupby('tags', as_index=False)
             .agg(**{'count': ('tags', 'size')})
          )
    

    Output:

          tags  count
    0   family      2
    1  friends      2
    2  holiday      1
    

    With a more meaningful aggregation:

    out = (df.explode('tags')
             .groupby('tags', as_index=False)
             .agg({'Id': frozenset})
          )
    

    Output:

          tags              Id
    0   family  (item4, item1)
    1  friends  (item2, item1)
    2  holiday         (item4)
    

    Note however that explode is quite expensive, so if you just want to count the tags, better use pure python:

    from collections import Counter
    from itertools import chain
    
    out = Counter(chain.from_iterable(df['tags']))
    

    Output:

    Counter({'friends': 2, 'family': 2, 'holiday': 1})
    

    And if you want to split the DataFrame like your grpby object:

    tmp = df.assign(group=df['tags']).explode('group')
    group = tmp.pop('group')
    
    out = dict(list(tmp.groupby(group)))
    

    Output:

    {'family':       Id               tags
               0  item1  [friends, family]
               3  item4  [family, holiday],
     'friends':       Id               tags
               0  item1  [friends, family]
               1  item2          [friends],
     'holiday':       Id               tags
               3  item4  [family, holiday]}