Search code examples
pythonpandasgroup-by

enumerate items in each group


I have a DataFrame like

    id   chi  prop   ord 
0   100   L    67     0 
1   100   L    68     1 
2   100   L    68     2 
3   100   L    68     3 
4   100   L    70     0 
5   100   L    71     0 
6   100   R    67     0 
7   100   R    68     1 
8   100   R    68     2 
9   100   R    68     3 
10  110   R    70     0 
11  110   R    71     0 
12  101   L    67     0 
13  101   L    68     0 
14  101   L    69     0 
15  101   L    71     0 
16  101   L    72     0 
17  201   R    67     0 
18  201   R    68     0 
19  201   R    69     0

ord essentially gives the ordering of the entries when (prop, chi and id) all have the same value. This isn't quite what I'd like though. Instead, I'd like to be able to enumerate the entries of each group g in {(id, chi)} from 0 to n_g where n_g is the size of group g. So I'd like to obtain something that looks like

    id   chi  prop   count 
0   100   L    67     0 
1   100   L    68     1 
2   100   L    68     2 
3   100   L    68     3 
4   100   L    70     4 
5   100   L    71     5 
6   100   R    67     0 
7   100   R    68     1 
8   100   R    68     2 
9   100   R    68     3 
10  110   R    70     0 
11  110   R    71     1 
12  101   L    67     0 
13  101   L    68     1 
14  101   L    69     2 
15  101   L    71     3 
16  101   L    72     4 
17  201   R    67     0 
18  201   R    68     1 
19  201   R    69     2

I'd like to know if there's a simple way of doing this with pandas. The following comes very close, but it feels way too complicated, and it for some reason won't let me join the resulting dataframe with the original one.

(df.groupby(['id', 'chi'])
   .apply(lambda g: np.arange(g.shape[0]))
   .apply(pd.Series, 1)
   .stack()
   .rename('counter')
   .reset_index()         
   .drop(columns=['level_2']))

EDIT: A second way of course is the for loop way, but I'm looking for something more "Pythonic" than:

for gname, idx in df.groupby(['id','chi']).groups.items():
    tmp = df.loc[idx]
    df.loc[idx, 'counter'] = np.arange(tmp.shape[0])

R has a very simple way of achieving this behaviour using the tidyverse packages, but I haven't quite found the well-oiled way to achieve the same thing with pandas. Any help provided is greatly appreciated!


Solution

  • .groupby().cumcount()

    df.assign(ord=df.groupby(['id', 'chi']).cumcount())
    
         id chi  prop  ord
    0   100   L    67    0
    1   100   L    68    1
    2   100   L    68    2
    3   100   L    68    3
    4   100   L    70    4
    5   100   L    71    5
    6   100   R    67    0
    7   100   R    68    1
    8   100   R    68    2
    9   100   R    68    3
    10  110   R    70    0
    11  110   R    71    1
    12  101   L    67    0
    13  101   L    68    1
    14  101   L    69    2
    15  101   L    71    3
    16  101   L    72    4
    17  201   R    67    0
    18  201   R    68    1
    19  201   R    69    2
    

    defaultdict and itertools.count

    from itertools import count
    from collections import defaultdict
    
    d = defaultdict(count)
    
    df.assign(ord=[next(d[t]) for t in zip(df.id, df.chi)])
    
         id chi  prop  ord
    0   100   L    67    0
    1   100   L    68    1
    2   100   L    68    2
    3   100   L    68    3
    4   100   L    70    4
    5   100   L    71    5
    6   100   R    67    0
    7   100   R    68    1
    8   100   R    68    2
    9   100   R    68    3
    10  110   R    70    0
    11  110   R    71    1
    12  101   L    67    0
    13  101   L    68    1
    14  101   L    69    2
    15  101   L    71    3
    16  101   L    72    4
    17  201   R    67    0
    18  201   R    68    1
    19  201   R    69    2