I have a dataframe that includes various ages and a value for each age. I would like to classify the different ages into broader age groups. To do so, I had to create a fairly complicated nested ifelse statement:
library(tidyverse)
df <- data.frame(age = c("15 to 17", "18 and 19", "20 to 24", "25 to 29", "30 to 34", "35 to 39", "40 to 44", "45 to 49", "50 to 54", "55 to 59"),
value = sample(1000:2000,10, replace=TRUE))
new_df = df %>%
mutate(age_band =
ifelse(age %in% c("15 to 17","18 and 19"), '15 to 19', ifelse(age %in% c("20 to 24","25 to 29"), '20s', ifelse(age %in% c("30 to 34","35 to 39"), '30s','40+'))))
Is there any way to do this without using such complex nested statements? The data goes all the way to ages 85 and over, and it got very complicated to classify each different age group.
Following @Claudiu Papasteri recommendations:
library(dplyr)
df %>%
mutate(age_band = case_when( age %in% c("15 to 17","18 and 19") ~ '15 to 19',
age %in% c("20 to 24","25 to 29") ~ '20s',
age %in% c("30 to 34","35 to 39") ~ '30s',
TRUE ~'40+')
)
ps: I would include all specific statements in case_when
, even the last one for "40+", this way you can keep track of any problem in your dataset. Because if something is missing, or you have a typo, everything else will be coded as "40+". So by stating every case you can catch the problem and fix everything before running an analysis or generating graphs. And the last statement could be changed to TRUE ~ age
. This means, whatever remains, use the same value, or TRUE ~ NA
, that you recode the remaining stuff for missing. So you know that all missing means you have a problem you have to fix.
age value age_band
1 15 to 17 1432 15 to 19
2 18 and 19 1112 15 to 19
3 20 to 24 1265 20s
4 25 to 29 1076 20s
5 30 to 34 1212 30s
6 35 to 39 1238 30s
7 40 to 44 1384 40+
8 45 to 49 1612 40+
9 50 to 54 1606 40+
10 55 to 59 1897 40+