Search code examples
pandasnumpymathgroup-bycombinations

How to groupby set of columns also grouping by a column with combinations of its items and aggregate on the value column?


Here is the dataframe Input df

df = pd.DataFrame({'county':['Laramie']*10 + ['Albany']*12,
                   'co': ['LU']*22,
                    'tech':['cable']*6+['copper']*4+['cable']*6+['copper']*4+['Fiber']*2,
                    'loc':[*'abcdefdefgmnopqrnostow']})

I want to groupby on county, co, and all the combinations of the items from tech column and aggregate on loc column to get the unique and nunique.

Here is the result I am looking for: Output df

Tried this:

df = df.groupby(['county', 'co'], as_index=True).agg({'tech':'unique', 'loc':'unique', 'loc':'nunique'}).reset_index()

But this doesn't give out all the possible combinations of the tech column.


Solution

  • This looks like a partial powerset:

    from itertools import chain, combinations
    
    # modified powerset recipe
    def powerset(iterable, min=0):
        s = list(iterable)
        return chain.from_iterable(combinations(s, r) for r in range(min, len(s)+1))
    
    group = ['county', 'co']
    
    tmp = df.groupby(['tech', *group])['loc'].agg(set)
    
    out = []
    for k, g in tmp.groupby(group):
        for x in map(list, powerset(g.index.get_level_values('tech'), min=2)):
            out.append(dict(zip(group, k))
                      |{'tech': ', '.join(x),
                        'log': (s:=set().union(*g.loc[x])),
                        'log-nunique': len(s),
                       })
    
    out = pd.DataFrame(out)
    

    NB. this uses the walrus operator (:=, python ≥3.8) and dictionary union ({}|{}, python ≥3.9), none of which are strictly required, the code can be adapted for older python versions*.

    Output:

        county  co                  tech                          log  log-nunique
    0   Albany  LU          Fiber, cable        {o, n, m, r, q, w, p}            7
    1   Albany  LU         Fiber, copper              {o, n, s, t, w}            5
    2   Albany  LU         cable, copper     {o, n, m, s, r, q, t, p}            8
    3   Albany  LU  Fiber, cable, copper  {o, n, m, s, r, q, t, w, p}            9
    4  Laramie  LU         cable, copper        {b, a, c, e, f, g, d}            7
    

    * code for older python:

    group = ['county', 'co']
    
    tmp = df.groupby(['tech', 'county', 'co'])['loc'].agg(set)
    
    out = []
    for k, g in tmp.groupby(['county', 'co']):
        for x in map(list, powerset(g.index.get_level_values('tech'), min=2)):
            out.append({'county': k[0],
                        'co': k[1],
                        'tech': ', '.join(x),
                        'log': set().union(*g.loc[x]),
                       })
            out[-1]['log-nunique'] = len(out[-1]['log'])
    
    out = pd.DataFrame(out)