Search code examples
rgroupingcountry-codes

How to group values in a column (R)


I'm creating a summary table that groups my records by country of destination:

SummarybyLocation <- PSTNRecords %>% 
                           group_by(Destination) %>%
                                summarize(
                                  Calls = n(), 
                                  Minutes = sum(durationMinutes), 
                                  MaxDuration = max(durationMinutes),
                                  AverageDuration = mean(durationMinutes), 
                                  Charges = sum(charge),
                                  Fees = sum(connectionCharge)
                                )

SummarybyLocation

The resulting table is as follows:

Summarized table

I realized that the Destination values are inconsistent (for example, "France" and "FR" both refer to the same area, and then I have a "North America" that I presume gathers USA and Canada.

I was wondering if there's a way of creating custom groups for these values, so that the aggregation would make more sense. I tried to use the countrycode package to add an iso2c column, but that doesn't resolve the problem of managing other area aggregations like "North America".

I would really appreciate some suggestions on how to handle this.

Thanks in advance!


Solution

  • Here is one possibility for cleaning up the data with a very minimal example. First, I get a list of country names and the 2 and 3 letter abbreviations, and put into a dataframe, countries. Then, I left_join countries to df for the two letter code, which in this case matches FR. Then, I repeat the left_join but with the 3 letter code, which has no matches in this case. Then, I coalesce the two new columns together, i.e., Country.x and Country.y. Then, I use case_when to multiple if-else statements. First, if Country is not an NA, then I replace Destination with the full country name. This is where you can add in other arguments if you have other items (e.g., Europe) that you might also need to fix. Next, I replace North America with "United States-Canada-Mexico". Finally, I remove the columns that start with "Country".

    library(XML)
    library(RCurl)
    library(rlist)
    library(tidyverse)
    
    theurl <-
      getURL("https://www.iban.com/country-codes",
             .opts = list(ssl.verifypeer = FALSE))
    countries <- readHTMLTable(theurl)
    countries <-
      list.clean(countries, fun = is.null, recursive = FALSE)[[1]]
    
    
    df %>%
      left_join(.,
                countries %>% select(Country, `Alpha-2 code`),
                by = c("Destination" = "Alpha-2 code")) %>%
      left_join(.,
                countries %>% select(Country, `Alpha-3 code`),
                by = c("Destination" = "Alpha-3 code")) %>%
      mutate(
        Country = coalesce(Country.x, Country.y),
        Destination = case_when(!is.na(Country) ~ Country,
                                Destination == "North America" ~ "United States-Canada-Mexico",
                                TRUE ~ Destination
      )) %>%
    select(-c(starts_with("Country")))
    

    Output

                      Destination durationMinutes charge connectionCharge
    1                      France            6.57   0.00                0
    2                      France            3.34   1.94                0
    3               United States          234.40   3.00                0
    4 United States-Canada-Mexico           23.40   2.00                0
    

    However, if you have a lot of different variations, then you probably just want to create a simple dataframe with the substitutions, as then you can just do one left_join.

    Another option is to also add in a Continent column, which you could get from countrycode.

    library(countrycode)
    
    countrycode(sourcevar = df$Destination,
                origin = "country.name",
                destination = "continent")
    
    [1] NA         "Europe"   "Americas" NA   
    

    Data

    df <- structure(list(Destination = c("FR", "France", "United States", 
    "North America"), durationMinutes = c(6.57, 3.34, 234.4, 23.4
    ), charge = c(0, 1.94, 3, 2), connectionCharge = c(0, 0, 0, 0
    )), class = "data.frame", row.names = c(NA, -4L))