Search code examples
pythonpandasranking-functions

How to create a ranking variable/function for different periods in a panel data?


I have a dataset, df, that looks like this:

Date Code City State Population Quantity QTDPERCAPITA
2020-01 11001 Los Angeles CA 5000000 100000 0.02
2020-02 11001 Los Angeles CA 5000000 125000 0.025
2020-03 11001 Los Angeles CA 5000000 135000 0.027
2020-01 12002 Houston TX 3000000 150000 0.05
2020-02 12002 Houston TX 3000000 100000 0.033
2020-03 12002 Houston TX 3000000 200000 0.066
... ... ... ... ... ... ...
2021-07 11001 Los Angeles CA 5500499 340000 0.062
2021-07 12002 Houston TX 3250012 211000 0.065

WhereQTDPERCAPITA is simply Quantity/Population. I have multiple cities (4149 to be more precise).

The quantities change according to every month, and so does the population.

I would like to create a new variable that serve as a ranking, ranging from [0,1], where 0 is the city with the lowest QTDPERCAPITA in that month, and 1 is the city with the most quantity per capita in that month. Essentially, I want to create a new column that looks like this:

Date Code City State Population Quantity QTDPERCAPITA RANKING
2020-01 11001 Los Angeles CA 5000000 100000 0.02 0
2020-02 11001 Los Angeles CA 5000000 125000 0.025 0
2020-03 11001 Los Angeles CA 5000000 135000 0.027 0
2020-01 12002 Houston TX 3000000 150000 0.05 1
2020-02 12002 Houston TX 3000000 100000 0.033 1
2020-03 12002 Houston TX 3000000 200000 0.066 1
... ... ... ... ... ... ... ...
2021-07 11001 Los Angeles CA 5500499 340000 0.062 0
2021-07 12002 Houston TX 3250012 211000 0.065 1

How can I create this column such that the RANKING changes every month? I was thinking of a for loop that extracts the QTDPERCAPITA for every city on every unique date, and creates a new column, df['RANKING'] with the same date and city.


Solution

  • You can use:

    # MinMax scaler: (rank - min) / (max - min)
    ranking = lambda x: (x.rank() - 1) / (len(x) - 1)
    
    # Rank between [0, 1] -> 0 the lowest, 1 the highest
    df['RANKING'] = df.groupby('Date')['QTDPERCAPITA'].apply(ranking)
    
    # Rank between [1, 4149] -> 1 the lowest, 4149 the highest
    # df['RANKING'] = df.groupby('Date')['QTDPERCAPITA'].rank('dense')
    

    Output:

    Date Code City State Population Quantity QTDPERCAPITA RANKING
    2020-01 11001 Los Angeles CA 5000000 100000 0.02 0
    2020-02 11001 Los Angeles CA 5000000 125000 0.025 0
    2020-03 11001 Los Angeles CA 5000000 135000 0.027 0
    2020-01 12002 Houston TX 3000000 150000 0.05 1
    2020-02 12002 Houston TX 3000000 100000 0.033 1
    2020-03 12002 Houston TX 3000000 200000 0.066 1
    2021-07 11001 Los Angeles CA 5500499 340000 0.618 1
    2021-07 12002 Houston TX 3250012 211000 0.065 0