Search code examples
python-3.xpandasdata-analysis

Rank within groups using python-pandas


I'm comparing a set of eight algorithms (solver column) using a set of instances, each instance is executed once for each algorithm and a level of a parameter D (goes from 1 to 10). So, the resulting data frame would look like this:

         instance  D    z             solver
0   1000_ep0.0075  1  994         threatened
1   1000_ep0.0075  1  993               desc
2   1000_ep0.0075  1  994             degree
3   1000_ep0.0075  1  993    threatened_desc
4   1000_ep0.0075  1  993  threatened_degree
5   1000_ep0.0075  1  994         desc_later
6   1000_ep0.0075  1  994       degree_later
7   1000_ep0.0075  1  993         dyn_degree
8   1000_ep0.0075  2  986         threatened
9   1000_ep0.0075  2  987               desc
10  1000_ep0.0075  2  988             degree
11  1000_ep0.0075  2  987    threatened_desc
12  1000_ep0.0075  2  986  threatened_degree
13  1000_ep0.0075  2  987         desc_later
14  1000_ep0.0075  2  988       degree_later
15  1000_ep0.0075  2  987         dyn_degree
....

Where the z column corresponds to the value found by the algorithm (smaller the better).

I would like to add a column to the dataframe, corresponding to the rank of each algorithm according to the value of z for each combination <instance, D>. For the example above, would be something like this:

         instance  D    z             solver z_rank
0   1000_ep0.0075  1  994         threatened 2
1   1000_ep0.0075  1  993               desc 1
2   1000_ep0.0075  1  994             degree 2
3   1000_ep0.0075  1  993    threatened_desc 1
4   1000_ep0.0075  1  993  threatened_degree 1
5   1000_ep0.0075  1  994         desc_later 2
6   1000_ep0.0075  1  994       degree_later 2
7   1000_ep0.0075  1  993         dyn_degree 1
8   1000_ep0.0075  2  986         threatened 1
9   1000_ep0.0075  2  987               desc 2
10  1000_ep0.0075  2  988             degree 3
11  1000_ep0.0075  2  987    threatened_desc 2
12  1000_ep0.0075  2  986  threatened_degree 1
13  1000_ep0.0075  2  987         desc_later 2
14  1000_ep0.0075  2  988       degree_later 3
15  1000_ep0.0075  2  987         dyn_degree 2
...

Using python-pandas, this is what I could get so far:

df.loc[:, 'z_rank'] = df_rg.groupby(['instance', 'D'])['z'].rank()
df.head(16)
         instance  D    z             solver  z_rank
0   1000_ep0.0075  1  994         threatened    47.5
1   1000_ep0.0075  1  993               desc    16.5
2   1000_ep0.0075  1  994             degree    47.5
3   1000_ep0.0075  1  993    threatened_desc    16.5
4   1000_ep0.0075  1  993  threatened_degree    16.5
5   1000_ep0.0075  1  994         desc_later    47.5
6   1000_ep0.0075  1  994       degree_later    47.5
7   1000_ep0.0075  1  993         dyn_degree    16.5
8   1000_ep0.0075  2  986         threatened     7.0
9   1000_ep0.0075  2  987               desc    18.5
10  1000_ep0.0075  2  988             degree    44.5
11  1000_ep0.0075  2  987    threatened_desc    18.5
12  1000_ep0.0075  2  986  threatened_degree     7.0
13  1000_ep0.0075  2  987         desc_later    18.5
14  1000_ep0.0075  2  988       degree_later    44.5
15  1000_ep0.0075  2  987         dyn_degree    18.5

Which is clearly not what I want.

Could somebody help me with that?


Solution

  • You require method=dense in SeriesGroupBy.rank() where the ranks increase by 1 between groups:

    df['z_rank'] = df.groupby(['instance', 'D'])['z'].rank(method='dense').astype(int)
    

    enter image description here