Search code examples
rcategorization

How to group a column with character values in a new column in r


I have a data set with countries column, I want to create a new column and classify the countries into the following categories (first world, second world, third world) countries. I'm relatively new to R and I'm finding it difficult to find a proper function that deals with characters!

My dataset contains the countries like this, and I have three vectors with a list of countries as shown below:

nt_final_table$`Country name`
#[1] "Finland"                   "Denmark"                   "Switzerland"              
#[4] "Iceland"                   "Netherlands"               "Norway"                   
#[7] "Sweden"                    "Luxembourg"                "New Zealand"              
#[10] "Austria"                   "Australia"                 "Israel"       

first_world_countries <- c("Australia","Austria","Belgium","Canada","Denmark","France","Germany","Greece","Iceland","Ireland","Israel","Italy","Japan","Luxembourg","Netherlands","New Zealand","Norway","Portugal","South Korea",
"Spain","Sweden","Switzerland","Turkey","United Kingdom","USA")

Second_world_countries <- c("Albania","Armenia","Azerbaijan","Belarus","Bosnia and Herzegovina","Bulgaria","China","Croatia","Cuba","Czech Republic","EastGermany","Estonia","Georgia","Hungary","Kazakhstan","Kyrgyzstan","Laos","Poland","Romania","Russia","Serbia","Slovakia","Slovenia","Tajikistan","Turkmenistan","Ukraine","Uzbekistan","Vietnam")

Third_world_countries <- ("Somalia","Niger","South Sudan")

I would want a new column that contains the following values : First World, Second World, Third World based on the Country name column

Any help would be appreciated! Thanks!


Solution

  • Here are 2 ways you could do this.

    Using dplyr package

    You could use case_when from the dplyr package to do this.

    
    library(dplyr)
    
    country_name <-c("Finland", "Denmark", "Switzerland","Iceland", "Netherlands", "Norway", "Sweden", "Luxembourg", "New Zealand",
                     "Austria", "Australia", "Israel")
    
    nt_final_table <- data.frame(country_name)
    
    first_world_countries <- c("Australia","Austria","Belgium","Canada","Denmark","France","Germany","Greece","Iceland","Ireland","Israel","Italy","Japan","Luxembourg","Netherlands","New Zealand","Norway","Portugal","South Korea", "Spain","Sweden","Switzerland","Turkey","United Kingdom","USA")
    
    second_world_countries <- c("Albania","Armenia","Azerbaijan","Belarus","Bosnia and Herzegovina","Bulgaria","China","Croatia","Cuba","Czech Republic","EastGermany","Estonia","Georgia","Hungary","Kazakhstan","Kyrgyzstan","Laos","Poland","Romania","Russia","Serbia","Slovakia","Slovenia","Tajikistan","Turkmenistan","Ukraine","Uzbekistan","Vietnam")
    
    third_world_countries <- c("Somalia","Niger","South Sudan")
    
    nt_final_table_categorized <- nt_final_table %>% mutate(category = case_when(country_name %in% first_world_countries ~ "First",
                                                   country_name %in% second_world_countries ~ "Second",
                                                   country_name %in% third_world_countries ~ "Third",
                                                   TRUE ~"Not listed"))
    
    nt_final_table_categorized
    

    Sample output

       country_name   category
    1       Finland Not listed
    2       Denmark      First
    3   Switzerland      First
    4       Iceland      First
    5   Netherlands      First
    6        Norway      First
    7        Sweden      First
    8    Luxembourg      First
    9   New Zealand      First
    10      Austria      First
    11    Australia      First
    12       Israel      First
    

    Using base R

    In base R we could create a data frame that lists the countries and their category then use merge to perform a left-join on the 2 dataframes.

    country_name <-c("Finland", "Denmark", "Switzerland","Iceland", "Netherlands", "Norway", "Sweden", "Luxembourg", "New Zealand",
                     "Austria", "Australia", "Israel")
    
    nt_final_table <- data.frame(country_name)
    
    first_world_countries <- c("Australia","Austria","Belgium","Canada","Denmark","France","Germany","Greece","Iceland","Ireland","Israel","Italy","Japan","Luxembourg","Netherlands","New Zealand","Norway","Portugal","South Korea", "Spain","Sweden","Switzerland","Turkey","United Kingdom","USA")
    
    second_world_countries <- c("Albania","Armenia","Azerbaijan","Belarus","Bosnia and Herzegovina","Bulgaria","China","Croatia","Cuba","Czech Republic","EastGermany","Estonia","Georgia","Hungary","Kazakhstan","Kyrgyzstan","Laos","Poland","Romania","Russia","Serbia","Slovakia","Slovenia","Tajikistan","Turkmenistan","Ukraine","Uzbekistan","Vietnam")
    
    third_world_countries <- c("Somalia","Niger","South Sudan")
    
    country_name <- c(first_world_countries,second_world_countries,third_world_countries)
    
    categories <- c(rep("First", length(first_world_countries)),
                    rep("Second",length(second_world_countries)),
                    rep("Third",length(third_world_countries)))
    
    all_countries_categorised <- data.frame(country_name, categories)
    
    nt_final_table_categorized <-merge(nt_final_table, all_countries_categorised, by ="country_name", all.x=TRUE)
    
    nt_final_table_categorized
    

    Sample output

       country_name categories
    1     Australia      First
    2       Austria      First
    3       Denmark      First
    4       Finland       <NA>
    5       Iceland      First
    6        Israel      First
    7    Luxembourg      First
    8   Netherlands      First
    9   New Zealand      First
    10       Norway      First
    11       Sweden      First
    12  Switzerland      First