Here is the dataframe
df = pd.DataFrame({'county':['Laramie']*10 + ['Albany']*12,
'co': ['LU']*22,
'tech':['cable']*6+['copper']*4+['cable']*6+['copper']*4+['Fiber']*2,
'loc':[*'abcdefdefgmnopqrnostow']})
I want to groupby on county, co, and all the combinations of the items from tech column and aggregate on loc column to get the unique and nunique.
Here is the result I am looking for:
Tried this:
df = df.groupby(['county', 'co'], as_index=True).agg({'tech':'unique', 'loc':'unique', 'loc':'nunique'}).reset_index()
But this doesn't give out all the possible combinations of the tech column.
This looks like a partial powerset
:
from itertools import chain, combinations
# modified powerset recipe
def powerset(iterable, min=0):
s = list(iterable)
return chain.from_iterable(combinations(s, r) for r in range(min, len(s)+1))
group = ['county', 'co']
tmp = df.groupby(['tech', *group])['loc'].agg(set)
out = []
for k, g in tmp.groupby(group):
for x in map(list, powerset(g.index.get_level_values('tech'), min=2)):
out.append(dict(zip(group, k))
|{'tech': ', '.join(x),
'log': (s:=set().union(*g.loc[x])),
'log-nunique': len(s),
})
out = pd.DataFrame(out)
NB. this uses the walrus operator (:=
, python ≥3.8) and dictionary union ({}|{}
, python ≥3.9), none of which are strictly required, the code can be adapted for older python versions*.
Output:
county co tech log log-nunique
0 Albany LU Fiber, cable {o, n, m, r, q, w, p} 7
1 Albany LU Fiber, copper {o, n, s, t, w} 5
2 Albany LU cable, copper {o, n, m, s, r, q, t, p} 8
3 Albany LU Fiber, cable, copper {o, n, m, s, r, q, t, w, p} 9
4 Laramie LU cable, copper {b, a, c, e, f, g, d} 7
* code for older python:
group = ['county', 'co']
tmp = df.groupby(['tech', 'county', 'co'])['loc'].agg(set)
out = []
for k, g in tmp.groupby(['county', 'co']):
for x in map(list, powerset(g.index.get_level_values('tech'), min=2)):
out.append({'county': k[0],
'co': k[1],
'tech': ', '.join(x),
'log': set().union(*g.loc[x]),
})
out[-1]['log-nunique'] = len(out[-1]['log'])
out = pd.DataFrame(out)