I found this data file on covid vaccinations, and I'd like to see the vaccination coverage in (parts of) the population. It'll probably become more clear with the actual example, so bear with me.
If I read the csv using df = pd.read_csv('https://epistat.sciensano.be/Data/COVID19BE_VACC.csv', parse_dates=['DATE'])
I get this result:
DATE REGION AGEGROUP SEX BRAND DOSE COUNT
0 2020-12-28 Brussels 18-34 F Pfizer-BioNTech A 1
1 2020-12-28 Brussels 45-54 F Pfizer-BioNTech A 2
2 2020-12-28 Brussels 55-64 F Pfizer-BioNTech A 3
3 2020-12-28 Brussels 55-64 M Pfizer-BioNTech A 1
4 2020-12-28 Brussels 65-74 F Pfizer-BioNTech A 2
I'm particularly interested in the numbers by region & date.
So I regrouped using df.groupby(['REGION','DATE']).sum()
COUNT
REGION DATE
Brussels 2020-12-28 56
2020-12-30 5
2021-01-05 725
2021-01-06 989
2021-01-07 994
... ...
Wallonia 2021-06-18 49567
2021-06-19 43577
2021-06-20 2730
2021-06-21 37193
2021-06-22 16938
In order to compare vaccination 'speeds' in different regions I have to transform the data from absolute to relative numbers, using the population from each region.
I have found some posts explaining how to calculate percentages in a multi-index dataframe like this, but the problem is that I want to divide each COUNT by a population number that is not in the original dataframe.
The population numbers are here below
REGION POP
Flanders 6629143
Wallonia 3645243
Brussels 1218255
I think the solution must be in looping through the original df and checking both REGIONs or index levels, but I have absolutely no idea how. It's a technique I'd like to master, because it might come in handy when I want some other subsets with different populations (AGEGROUP or SEX maybe).
Thank you so much for reading this far!
Disclaimer: I've only just started out using Python, and this is my very first question on Stack Overflow, so please be gentle with me... The reason why I'm posting this is because I can't find an answer anywhere else. This is probably because I haven't got the terminology down and I don't exactly know what to look for ^_^
One option would be to reformat the population_df
with set_index
+ rename
:
population_df = pd.DataFrame({
'REGION': {0: 'Flanders', 1: 'Wallonia', 2: 'Brussels'},
'POP': {0: 6629143, 1: 3645243, 2: 1218255}
})
denom = population_df.set_index('REGION').rename(columns={'POP': 'COUNT'})
denom
:
COUNT
REGION
Flanders 6629143
Wallonia 3645243
Brussels 1218255
Then div
the results of groupby sum
relative to level=0
:
new_df = df.groupby(['REGION', 'DATE']).agg({'COUNT': 'sum'}).div(denom, level=0)
new_df
:
COUNT
REGION DATE
Brussels 2020-12-28 0.000046
2020-12-30 0.000004
2021-01-05 0.000595
2021-01-06 0.000812
2021-01-07 0.000816
... ...
Wallonia 2021-06-18 0.013598
2021-06-19 0.011954
2021-06-20 0.000749
2021-06-21 0.010203
2021-06-22 0.004647
Or as a new column:
new_df = df.groupby(['REGION', 'DATE']).agg({'COUNT': 'sum'})
new_df['NEW'] = new_df.div(denom, level=0)
new_df
:
COUNT NEW
REGION DATE
Brussels 2020-12-28 56 0.000046
2020-12-30 5 0.000004
2021-01-05 725 0.000595
2021-01-06 989 0.000812
2021-01-07 994 0.000816
... ... ...
Wallonia 2021-06-18 49567 0.013598
2021-06-19 43577 0.011954
2021-06-20 2730 0.000749
2021-06-21 37193 0.010203
2021-06-22 16938 0.004647