Search code examples
pythonpandasfilteringranking

Is there a way to rank some items in a pandas dataframe and exclude others?


I have a pandas dataframe called ranks with my clusters and their key metrics. I rank them them using rank() however there are two specific clusters which I want ranked differently to the others.

ranks = pd.DataFrame(data={'Cluster': ['0', '1', '2',
                                   '3', '4', '5','6', '7', '8', '9'],
                        'No. Customers': [145118, 
                                        2, 
                                        1236, 
                                        219847, 
                                        9837,
                                        64865,
                                        3855,
                                        219549,
                                        34171,
                                        3924120],  
                        'Ave. Recency': [39.0197, 
                                        47.0, 
                                        15.9716, 
                                        41.9736, 
                                        23.9330,
                                        24.8281,
                                        26.5647,
                                        17.7493,
                                        23.5205,
                                        24.7933],
                        'Ave. Frequency': [1.7264, 
                                        19.0, 
                                        24.9101, 
                                        3.0682, 
                                        3.2735,
                                        1.8599,
                                        3.9304,
                                        3.3356,
                                        9.1703,
                                        1.1684],
                        'Ave. Monetary': [14971.85, 
                                        237270.00, 
                                        126992.79, 
                                        17701.64, 
                                        172642.35,
                                        13159.21,
                                        54333.56,
                                        17570.67,
                                        42136.68,
                                        4754.76]})
ranks['Ave. Spend'] = ranks['Ave. Monetary']/ranks['Ave. Frequency']
   Cluster   No. Customers| Ave. Recency| Ave. Frequency| Ave. Monetary| Ave. Spend|
0    0           145118        39.0197       1.7264         14,971.85     8,672.07
1    1           2             47.0          19.0          237,270.00    12,487.89
2    2           1236          15.9716       24.9101       126,992.79     5,098.02
3    3           219847        41.9736       3.0682         17,701.64     5,769.23
4    4           9837          23.9330       3.2735        172,642.35    52,738.42
5    5           64865         24.8281       1.8599         13,159.21     7,075.19
6    6           3855          26.5647       3.9304         54,333.56    13,823.64
7    7           219549        17.7493       3.3356         17,570.67     5,267.52
8    8           34171         23.5205       9.1703         42,136.68     4,594.89
9    9           3924120       24.7933       1.1684          4,754.76     4,069.21 

I then apply the rank() method like this:

ranks['r_rank'] = ranks['Ave. Recency'].rank()
ranks['f_rank'] = ranks['Ave. Frequency'].rank(ascending=False)
ranks['m_rank'] = ranks['Ave. Monetary'].rank(ascending=False)
ranks['s_rank'] = ranks['Ave. Spend'].rank(ascending=False)
ranks['overall'] = ranks.apply(lambda row: row.r_rank + row.f_rank + row.m_rank + row.s_rank, axis=1)
ranks['overall_rank'] = ranks['overall'].rank(method='first')

Which gives me this:

   Cluster  No. Customers|Ave. Recency|Ave. Frequency|Ave. Monetary|Ave. Spend|r_rank|f_rank|m_rank|s_rank|overall|overall_rank
0    0          145118       39.0197      1.7264        14,971.85    8,672.07     8     9       8      4      29        9     
1    1          2            47.0         19.0         237,270.00   12,487.89     10    2       1      3      16        3 
2    2          1236         15.9716      24.9101      126,992.79    5,098.02     1     1       3      8      13        1
3    3          219847       41.9736      3.0682        17,701.64    5,769.23     9     7       6      6      28        7
4    4          9837         23.9330      3.2735       172,642.35   52,738.42     4     6       2      1      13        2
5    5          64865        24.8281      1.8599        13,159.21    7,075.19     6     8       9      5      28        8
6    6          3855         26.5647      3.9304        54,333.56   13,823.64     7     4       4      2      17        4
7    7          219549       17.7493      3.3356        17,570.67    5,267.52     2     5       7      7      21        6
8    8          34171        23.5205      9.1703        42,136.68    4,594.89     3     3       5      9      20        5
9    9          3924120      24.7933      1.1684         4,754.76    4,069.21     5     10      10     10     35        10

This does what it's suppose to do, however the cluster with the highest Ave. Spend needs to be ranked 1 at all times and the cluster with the highest Ave. Recency needs to be ranked last at all times.

So I modified the code above to look like this:

if(ranks['s_rank'].min() == 1):
    ranks['overall_rank_2'] = 1
elif(ranks['r_rank'].max() == len(ranks)):
    ranks['overall_rank_2'] = len(ranks)
else:
    ranks_2 = ranks.drop(ranks.index[[ranks[ranks['s_rank'] == ranks['s_rank'].min()].index[0],ranks[ranks['r_rank'] == ranks['r_rank'].max()].index[0]]])
    ranks_2['r_rank'] = ranks_2['Ave. Recency'].rank()
    ranks_2['f_rank'] = ranks_2['Ave. Frequency'].rank(ascending=False)
    ranks_2['m_rank'] = ranks_2['Ave. Monetary'].rank(ascending=False)
    ranks_2['s_rank'] = ranks_2['Ave. Spend'].rank(ascending=False)
    ranks_2['overall'] = ranks.apply(lambda row: row.r_rank + row.f_rank + row.m_rank + row.s_rank, axis=1)
    ranks['overall_rank_2'] = ranks_2['overall'].rank(method='first')

Then I get this

   Cluster  No. Customers|Ave. Recency|Ave. Frequency|Ave. Monetary|Ave. Spend|r_rank|f_rank|m_rank|s_rank|overall|overall_rank|overall_rank_2
0    0          145118       39.0197      1.7264        14,971.85    8,672.07     8     9       8      4      29        9             1     
1    1          2            47.0         19.0         237,270.00   12,487.89     10    2       1      3      16        3             1 
2    2          1236         15.9716      24.9101      126,992.79    5,098.02     1     1       3      8      13        1             1
3    3          219847       41.9736      3.0682        17,701.64    5,769.23     9     7       6      6      28        7             1
4    4          9837         23.9330      3.2735       172,642.35   52,738.42     4     6       2      1      13        2             1
5    5          64865        24.8281      1.8599        13,159.21    7,075.19     6     8       9      5      28        8             1
6    6          3855         26.5647      3.9304        54,333.56   13,823.64     7     4       4      2      17        4             1
7    7          219549       17.7493      3.3356        17,570.67    5,267.52     2     5       7      7      21        6             1
8    8          34171        23.5205      9.1703        42,136.68    4,594.89     3     3       5      9      20        5             1
9    9          3924120      24.7933      1.1684         4,754.76    4,069.21     5     10      10     10     35        10            1

Please help me modify the above if statement or perhaps recommend a different approach altogether. This ofcourse needs to be as dynamic as possible.


Solution

  • So you want a custom ranking on your dataframe, where the cluster(/row) with the highest Ave. Spend is always ranked 1, and the one with the highest Ave. Recency always ranks last.

    The solution is five lines. Notes:

    • You had the right idea with DataFrame.drop(), just use idxmax() to get the index of both of the rows that will need special treatment, and store it, so you don't need a huge unwieldy logical filter expression in your drop.
    • No need to make so many temporary columns, or the temporary copy ranks_2 = ranks.drop(...); just pass the result of the drop() into a rank() ...
    • ... via a .sum(axis=1) on your desired columns, no need to define a lambda, or save its output in the temp column 'overall'.
    • ...then we just feed those sum-of-ranks into rank(), which will give us values from 1..8, so we add 1 to offset the results of rank() to be 2..9. (You can generalize this).
    • And we manually set the 'overall_rank' for the Ave. Spend, Ave. Recency rows.
    • (Yes you could also implement all this as a custom function whose input is the four Ave. columns or else the four *_rank columns.)

    Code: (see at bottom for boilerplate to read in your dataframe, next time please make your example MCVE, to help us help you)

    # Compute raw ranks like you do
    ranks['r_rank'] = ranks['Ave. Recency'].rank()
    ranks['f_rank'] = ranks['Ave. Frequency'].rank(ascending=False)
    ranks['m_rank'] = ranks['Ave. Monetary'].rank(ascending=False)
    ranks['s_rank'] = ranks['Ave. Spend'].rank(ascending=False)
    
    # Find the indices of both the highest AveSpend and AveRecency    
    ismax = ranks['Ave. Spend'].idxmax()
    irmax = ranks['Ave. Recency'].idxmax()
    
    # Get the overall ranking for every row other than these... add 1 to offset for excluding the max-AveSpend row:
    ranks['overall_rank'] = 1 + ranks.drop(index = [ismax,irmax]) [['r_rank','f_rank','m_rank','s_rank']].sum(axis=1).rank(method='first')
    
    # (Note: in .loc[], can't mix indices (ismax) with column-names)
    ranks.loc[ ranks['Ave. Spend'].idxmax(), 'overall_rank' ] = 1 
    ranks.loc[ ranks['Ave. Recency'].idxmax(), 'overall_rank' ] = len(ranks)
    

    And here's the boilerplate to ingest your data:

    import pandas as pd
    
    from io import StringIO
    
    # """Cluster   No. Customers| Ave. Recency| Ave. Frequency| Ave. Monetary| Ave. Spend|
    dat = """
    0           145118        39.0197       1.7264         14,971.85     8,672.07
    1           2             47.0          19.0          237,270.00    12,487.89
    2           1236          15.9716       24.9101       126,992.79     5,098.02
    3           219847        41.9736       3.0682         17,701.64     5,769.23
    4           9837          23.9330       3.2735        172,642.35    52,738.42
    5           64865         24.8281       1.8599         13,159.21     7,075.19
    6           3855          26.5647       3.9304         54,333.56    13,823.64
    7           219549        17.7493       3.3356         17,570.67     5,267.52
    8           34171         23.5205       9.1703         42,136.68     4,594.89
    9           3924120       24.7933       1.1684          4,754.76     4,069.21 """
    
    # Remove the comma thousands-separator, to prevent your floats being read in as string
    dat = dat.replace(',', '')
    
    ranks = pd.read_csv(StringIO(dat), sep='\s+', names=
        "Cluster|No. Customers|Ave. Recency|Ave. Frequency|Ave. Monetary|Ave. Spend".split('|'))