Search code examples
statarankinggroup

Ranking within groups


I have data in Stata that looks like this -

State Year Revenue Rank
A 2019 30 1
A 2019 30 1
A 2019 40 2
A 2020 45 1
A 2020 50 2
B 2019 35 1
B 2019 45 2
B 2020 22 1
B 2020 40 2

The rank column above is what I would like to achieve. Please note that there could be rows like the first and second one that are duplicates in State, Year and Revenue. I want the same rank to be given for these two rows. I basically want ranking within each state and year. I tried group() but it did not give the desired result.


Solution

  • You're at liberty to call this ranking, but it doesn't correspond to

    1. what Stata supports with its egen, rank() function

    2. what it supports with its egen, group() function

    3. ranking in any strict statistical sense, whereby to a first approximation n observations are ranked 1 to n, or vice versa.

    No matter, as what you want requires only one command line.

    * Example generated by -dataex-. For more info, type help dataex
    clear
    input str1 state int year byte(revenue rank)
    "A" 2019 30 1
    "A" 2019 30 1
    "A" 2019 40 2
    "A" 2020 45 1
    "A" 2020 50 2
    "B" 2019 35 1
    "B" 2019 45 2
    "B" 2020 22 1
    "B" 2020 40 2
    end
    
    bysort state year (revenue) : gen wanted = sum(revenue != revenue[_n-1])
    
    list, sepby(state year)
    
         +----------------------------------------+
         | state   year   revenue   rank   wanted |
         |----------------------------------------|
      1. |     A   2019        30      1        1 |
      2. |     A   2019        30      1        1 |
      3. |     A   2019        40      2        2 |
         |----------------------------------------|
      4. |     A   2020        45      1        1 |
      5. |     A   2020        50      2        2 |
         |----------------------------------------|
      6. |     B   2019        35      1        1 |
      7. |     B   2019        45      2        2 |
         |----------------------------------------|
      8. |     B   2020        22      1        1 |
      9. |     B   2020        40      2        2 |
         +----------------------------------------+
    

    That is, you bump up the result every time you see a different value. This works for the first observation in any group as the tacit reference to the value in observation 0 results in missing, which is different from the value in the first observation.