Search code examples
stata

Generating a variable only including the top 4 firms with largest sales


My question is very related to the question below:

I want to generate a variable only including the top 4 firms with largest sales and exclude the rest.

In other words the new variable will only have values of the 4 firms with largest sales in a given industry for a given year and the rest will be .


Solution

  • Consider this:

    webuse grunfeld, clear
    bysort year (invest) : gen largest4 = cond(_n < _N - 3, ., invest) 
    sort year invest 
    list year largest4 if largest4 < . in 1/40, sepby(year) 
    
         +-----------------+
         | year   largest4 |
         |-----------------|
      7. | 1935      39.68 |
      8. | 1935      40.29 |
      9. | 1935      209.9 |
     10. | 1935      317.6 |
         |-----------------|
     17. | 1936      50.73 |
     18. | 1936      72.76 |
     19. | 1936      355.3 |
     20. | 1936      391.8 |
         |-----------------|
     27. | 1937      74.24 |
     28. | 1937       77.2 |
     29. | 1937      410.6 |
     30. | 1937      469.9 |
         |-----------------|
     37. | 1938       51.6 |
     38. | 1938      53.51 |
     39. | 1938      257.7 |
     40. | 1938      262.3 |
         +-----------------+
    

    If you had missing values, they would sort to the end of each block and mess up the results.

    So you need a trick more:

    generate OK = !missing(invest) 
    bysort OK year (invest) : gen Largest4 = cond(_n < _N - 3, ., invest) if OK 
    sort year invest 
    list year Largest4 if Largest4 < . in 1/40, sepby(year) 
    

    With this example, which you can run, there are no missing values and the results are the same.