Search code examples
stataproportions

Proportions by Year and State using egen Command


I am trying to generate a new variable that is equal to the share of winners by state for each year in Stata.

I am using the egen command and I would like to know if this is the appropriate command for what I am looking for. My dataset is extremely large so it is hard for me to check manually. I have created year dummies for each year and the award_winner is a binary variable where 1 is equal to businesses that won the award and 0 if the business did not win the award that year.

sort state year_dummy*
by state year_dummy*: egen winner_bystate_year = mean(award_winner)

Solution

  • This is easy enough to test with a small fake dataset in which correct answers are clear. I don't know why you introduced dummy variables when you could work directly with year, but the answer's the same.

    clear 
    set obs 12 
    gen state = cond(_n < 7, "A", "B")
    egen year = seq(), from(2019) to(2020) block(3)
    gen award_winner = real(word("0 0 0 0 0 1 0 1 1 1 1 1", _n)) 
    gen order = _n 
    tab year, gen(year)
    
    bysort state year?: egen suggested = mean(award_winner)
    
    bysort state year: egen better = mean(award_winner)
    
    sort order 
    list, sepby(state year)
    
         +-----------------------------------------------------------------------+
         | state   year   award_~r   order   year1   year2   sugges~d     better |
         |-----------------------------------------------------------------------|
      1. |     A   2019          0       1       1       0          0          0 |
      2. |     A   2019          0       2       1       0          0          0 |
      3. |     A   2019          0       3       1       0          0          0 |
         |-----------------------------------------------------------------------|
      4. |     A   2020          0       4       0       1   .3333333   .3333333 |
      5. |     A   2020          0       5       0       1   .3333333   .3333333 |
      6. |     A   2020          1       6       0       1   .3333333   .3333333 |
         |-----------------------------------------------------------------------|
      7. |     B   2019          0       7       1       0   .6666667   .6666667 |
      8. |     B   2019          1       8       1       0   .6666667   .6666667 |
      9. |     B   2019          1       9       1       0   .6666667   .6666667 |
         |-----------------------------------------------------------------------|
     10. |     B   2020          1      10       0       1          1          1 |
     11. |     B   2020          1      11       0       1          1          1 |
     12. |     B   2020          1      12       0       1          1          1 |
         +-----------------------------------------------------------------------+
    
    
    

    The general principle is simple and important: to test code for statistical software, use a simple dataset for which there are known or obvious answers. Here "known" could be answers given by an existing implementation in the same or other software that is presumed correct.