I've got one big dataframe called df_incl_countries_with_outliers
with over 6000 cases and 100 variables. For simplicity, consider two variables: Country
and death_rate
. Every case is a person from a certain country that did or did not die (death_rate
is dichotomous). I can get the counts of how many people died per country by using this code:
df_incl_countries_with_outliers %>% count(Country, death_rate)
The thing is though, that I want to calculate a new variable mortality
which is calculated by all deaths of a certain country, divided by the sum of all cases of said country(minus the missing values, of course). Something like:
count(deaths_of_country_A/(deaths_of_country_A + alive_of_country_A))
and this of course for every country.
Why do I want to do this? Because I want to make a boxplot of the new variable mortality
so that we know if certain countries are outliers.
I assume it has something to do with the dplyr::mutate
function to create the variable and perhaps something with the dplyr::group_by
to group by country, together with count()
but I'm afraid that's as far as my knowledge goes. Every attempt at solving this problem of mine is greatly appreciated!
Found it! This is the code:
df_incl_countries_with_outliers <- df_incl_countries_with_outliers %>%
group_by(Country) %>%
mutate(
Mortality = sum(death_rate == "Dead", na.rm = TRUE) / sum(!is.na(death_rate))
)
First we group by the different countries. So when I say "all" in the following sentences, that hits on "all those in the same country". Mutate
makes a new variable in the dataframe. The new variable is called Mortality which is equal to the sum of all the dead people (where missing values are not counted in the result) divided by the sum of all the cases (again, missing values omitted).