Search code examples
statapercentage

Is there a way to calculate percentages comparing observations?


My data set looks like this in Stata

YEAR Gender Presidents
2023 Male 5
2023 Female 6
2023 Total 11
2022 Male 3
2022 Female 2
2022 Total 5

I want to be able to create a new variable/column that gives me the percentages for that year by taking the fraction of male/total and female/total for that particular year. In other words, I want my dataset to have an additional column that looks like this:

YEAR Gender Presidents Perc_pres
2023 Male 5 0.45
2023 Female 6 0.55
2023 Total 11 1
2022 Male 3 0.6
2022 Female 2 0.4
2022 Total 5 1

I tried the egen perc_pres_`gender' = total(cond(gender==Female), 1, .) command to create new 2 new columns for female total and male total and then took percentages but that seems too long a way to get what I want. I just want one column to give me the percentages for both men and women and no additional columns.


Solution

  • Generally, a data layout such as yours is an example of spreadsheet thinking that doesn't match standard good practice in statistical software. Totals based on some observations (rows) should not be held in other observations in the same dataset, but in other variables.

    Specifically, your expression Gender == Female at best makes sense if Female is another variable in the dataset. It would be illegal for either of the two other (and more obvious) possibilities, that Gender is a string variable, in which case "Female" is needed, or that Gender is a numeric variable with value labels, in which case a different comparison is needed.

    Your apparent use of a local macro gender also doesn't make sense to me.

    You are confusing percents (which add to 100) with proportions or fractions (which add to 1).

    The code needed here is awkward largely because your data layout is awkward.

    Here is one solution, which hinges on the fact that Female Male Total sort in that order.

    * Example generated by -dataex-. For more info, type help dataex
    clear
    input int year str6 gender byte presidents
    2023 "Male"    5
    2023 "Female"  6
    2023 "Total"  11
    2022 "Male"    3
    2022 "Female"  2
    2022 "Total"   5
    end
    
    bysort year (gender) : gen percent = 100 * presidents / presidents[_N]
    
    list , sepby(year)
    
         +-------------------------------------+
         | year   gender   presid~s    percent |
         |-------------------------------------|
      1. | 2022   Female          2         40 |
      2. | 2022     Male          3         60 |
      3. | 2022    Total          5        100 |
         |-------------------------------------|
      4. | 2023   Female          6   54.54546 |
      5. | 2023     Male          5   45.45454 |
      6. | 2023    Total         11        100 |
         +-------------------------------------+
    

    If that doesn't help (enough), please use dataex to give an example (and see the stata tag wiki).