Search code examples
rdplyrcount

How to compute the variable mortality based on "grouped counts"?


I've got one big dataframe called df_incl_countries_with_outliers with over 6000 cases and 100 variables. For simplicity, consider two variables: Country and death_rate. Every case is a person from a certain country that did or did not die (death_rate is dichotomous). I can get the counts of how many people died per country by using this code:

df_incl_countries_with_outliers %>% count(Country, death_rate)

The thing is though, that I want to calculate a new variable mortality which is calculated by all deaths of a certain country, divided by the sum of all cases of said country(minus the missing values, of course). Something like:

count(deaths_of_country_A/(deaths_of_country_A + alive_of_country_A))

and this of course for every country.

Why do I want to do this? Because I want to make a boxplot of the new variable mortality so that we know if certain countries are outliers.

I assume it has something to do with the dplyr::mutate function to create the variable and perhaps something with the dplyr::group_by to group by country, together with count() but I'm afraid that's as far as my knowledge goes. Every attempt at solving this problem of mine is greatly appreciated!


Solution

  • Found it! This is the code:

    df_incl_countries_with_outliers <- df_incl_countries_with_outliers %>%
         group_by(Country) %>%
         mutate(
             Mortality = sum(death_rate == "Dead", na.rm = TRUE) / sum(!is.na(death_rate))
         )
    

    First we group by the different countries. So when I say "all" in the following sentences, that hits on "all those in the same country". Mutate makes a new variable in the dataframe. The new variable is called Mortality which is equal to the sum of all the dead people (where missing values are not counted in the result) divided by the sum of all the cases (again, missing values omitted).