Search code examples
pythonpandasmulti-indexpopulation

Pandas: how to calculate percentage of a population from elsewhere


I found this data file on covid vaccinations, and I'd like to see the vaccination coverage in (parts of) the population. It'll probably become more clear with the actual example, so bear with me.

If I read the csv using df = pd.read_csv('https://epistat.sciensano.be/Data/COVID19BE_VACC.csv', parse_dates=['DATE']) I get this result:

        DATE    REGION AGEGROUP SEX            BRAND DOSE  COUNT
0 2020-12-28  Brussels    18-34   F  Pfizer-BioNTech    A      1
1 2020-12-28  Brussels    45-54   F  Pfizer-BioNTech    A      2
2 2020-12-28  Brussels    55-64   F  Pfizer-BioNTech    A      3
3 2020-12-28  Brussels    55-64   M  Pfizer-BioNTech    A      1
4 2020-12-28  Brussels    65-74   F  Pfizer-BioNTech    A      2

I'm particularly interested in the numbers by region & date.
So I regrouped using df.groupby(['REGION','DATE']).sum()

                     COUNT
REGION   DATE             
Brussels 2020-12-28     56
         2020-12-30      5
         2021-01-05    725
         2021-01-06    989
         2021-01-07    994
...                    ...
Wallonia 2021-06-18  49567
         2021-06-19  43577
         2021-06-20   2730
         2021-06-21  37193
         2021-06-22  16938

In order to compare vaccination 'speeds' in different regions I have to transform the data from absolute to relative numbers, using the population from each region.

I have found some posts explaining how to calculate percentages in a multi-index dataframe like this, but the problem is that I want to divide each COUNT by a population number that is not in the original dataframe.

The population numbers are here below

REGION     POP
Flanders   6629143
Wallonia   3645243
Brussels   1218255

I think the solution must be in looping through the original df and checking both REGIONs or index levels, but I have absolutely no idea how. It's a technique I'd like to master, because it might come in handy when I want some other subsets with different populations (AGEGROUP or SEX maybe).

Thank you so much for reading this far!

Disclaimer: I've only just started out using Python, and this is my very first question on Stack Overflow, so please be gentle with me... The reason why I'm posting this is because I can't find an answer anywhere else. This is probably because I haven't got the terminology down and I don't exactly know what to look for ^_^


Solution

  • One option would be to reformat the population_df with set_index + rename:

    population_df = pd.DataFrame({
        'REGION': {0: 'Flanders', 1: 'Wallonia', 2: 'Brussels'},
        'POP': {0: 6629143, 1: 3645243, 2: 1218255}
    })
    
    denom = population_df.set_index('REGION').rename(columns={'POP': 'COUNT'})
    

    denom:

                COUNT
    REGION           
    Flanders  6629143
    Wallonia  3645243
    Brussels  1218255
    

    Then div the results of groupby sum relative to level=0:

    new_df = df.groupby(['REGION', 'DATE']).agg({'COUNT': 'sum'}).div(denom, level=0)
    

    new_df:

                            COUNT
    REGION   DATE                
    Brussels 2020-12-28  0.000046
             2020-12-30  0.000004
             2021-01-05  0.000595
             2021-01-06  0.000812
             2021-01-07  0.000816
    ...                       ...
    Wallonia 2021-06-18  0.013598
             2021-06-19  0.011954
             2021-06-20  0.000749
             2021-06-21  0.010203
             2021-06-22  0.004647
    

    Or as a new column:

    new_df = df.groupby(['REGION', 'DATE']).agg({'COUNT': 'sum'})
    new_df['NEW'] = new_df.div(denom, level=0)
    

    new_df:

                         COUNT       NEW
    REGION   DATE                       
    Brussels 2020-12-28     56  0.000046
             2020-12-30      5  0.000004
             2021-01-05    725  0.000595
             2021-01-06    989  0.000812
             2021-01-07    994  0.000816
    ...                    ...       ...
    Wallonia 2021-06-18  49567  0.013598
             2021-06-19  43577  0.011954
             2021-06-20   2730  0.000749
             2021-06-21  37193  0.010203
             2021-06-22  16938  0.004647