Search code examples
rparsingdata.tablevectorization

Parsing string from column into data table efficiently in R


I have a data table with a field for counts of people from each country formatted as "XYZ:#" where XYZ is the country code and # the count.

Here is an example of 4 rows from the data table:

dt <-data.table(event = c("Event 1", "Event 2", "Event 3", "Event 4"),
       desc = c("Desc1", "Desc1", "Desc2", "Desc3"),
       countries = c("USA:433, MEX:132, GRC:58, GBR:50, IRL:35, ITA:20",
                     "ESP:42, DEU:40, ITA:20, SLE:7",
                     "GBR:78, JAM:63, USA:30, AUT:18, GHA:5",
                     "NLD:53, GBR:21, CHN:20"))

The countries field may contain any number of entries, with the only structure being that it starts from most numerous to least numerous. I need to parse this into a data table for analysis against combinations of event and descriptions. So I need a table showing the count by country code, with rows in the same order so that I can merge it back to the event and description fields.

My solution so far is very inefficient, it loops over each row and then parses the countries column first for the comma separator then for the colon. Here is what I have so far:

# Reference data for ISO 3-alpha country codes as character vector
nat_codes <- read.csv("countries_codes.csv")[[3]]

# Create empty data table

master_dt <- data.table(matrix(nrow = 0, ncol = length(nat_codes)+1))
names(master_dt) <- c(sort(nat_codes), "Unknown")

for (i in 1:nrow(dt)){
    
    # split each string by the "," separator
    row_component <- strsplit(dt$countries, ",")[[i]]
    
    country_codes <- c()
    numbers <- c()
    
    # Loop through each component and separate country from count
    for (j in 1:length(row_component)){
        
        my_split <- strsplit(row_component[[j]], ":")[[1]]
        
        # Extract the country code, add to vector
        country_codes[j] <- gsub("\\s+", "", my_split[1])
        
        # Extract the number
        numbers[j] <- as.numeric(my_split[2])
    }
    # Re-assemble country codes and numbers as a data table and bind
    

    row_dt <- data.table(country = country_codes, number = numbers)
    row_dt <- transpose(row_dt)
    names(row_dt) <- as.character(row_dt[1,])
    row_dt <- row_dt[-1,]
    
    master_dt <- rbindlist(list(master_dt, row_dt),
                           fill = TRUE,
                           use.names = TRUE)
    
}
# remove rows with no entries
master_dt <- Filter(function(x)!all(is.na(x)), master_dt)

The output I get from this is correct (although i would need to convert the type) but this approach is too inefficient to scale up.

    USA  MEX  GRC  GBR  IRL  ITA  ESP  DEU  SLE  JAM  AUT  GHA  NLD  CHN
1:  433  132   58   50   35   20 <NA> <NA> <NA> <NA> <NA> <NA> <NA> <NA>
2: <NA> <NA> <NA> <NA> <NA>   20   42   40    7 <NA> <NA> <NA> <NA> <NA>
3:   30 <NA> <NA>   78 <NA> <NA> <NA> <NA> <NA>   63   18    5 <NA> <NA>
4: <NA> <NA> <NA>   21 <NA> <NA> <NA> <NA> <NA> <NA> <NA> <NA>   53   20

I need to use this on hundreds of thousands of entries, but this is quite slow on a test set of c.1000 rows. What are the easiest wins here to improve performance? Base and data.table solutions preferred.

There is a secondary problem of getting country codes which are not in the reference data, but I can assign those to unknown or disregard those. Thank you.


Solution

  • Using sapply and a unique list countr of all countries.

    library(data.table)
    
    countr <- unique(sub(":.*", "", unlist(strsplit(dt$countries, ", "))))
    
    data.table(t(sapply(strsplit(dt$countries, ", "), function(co) 
      sapply(countr, function(x){
        res <- sub(".*:", "", grep(x, co, value=T))
        ifelse(identical(res, character(0)), NA, res)}))))
        USA  MEX  GRC  GBR  IRL  ITA  ESP  DEU  SLE  JAM  AUT  GHA  NLD  CHN
    1:  433  132   58   50   35   20 <NA> <NA> <NA> <NA> <NA> <NA> <NA> <NA>
    2: <NA> <NA> <NA> <NA> <NA>   20   42   40    7 <NA> <NA> <NA> <NA> <NA>
    3:   30 <NA> <NA>   78 <NA> <NA> <NA> <NA> <NA>   63   18    5 <NA> <NA>
    4: <NA> <NA> <NA>   21 <NA> <NA> <NA> <NA> <NA> <NA> <NA> <NA>   53   20
    

    For speed comparisons a slightly modified version, operating within the indexed data.table and using lapply

    countr <- unique(sub(":.*", "", unlist(strsplit(dt$countries, ", "))))
    
    setNames(data.table(t(dt[, lapply(strsplit(countries, ", "), function(co) 
      lapply(countr, function(x){
        res <- sub(".*:", "", grep(x, co, value=T))
        ifelse(identical(res, character(0)), NA, res)}))])), countr)
       USA MEX GRC GBR IRL ITA ESP DEU SLE JAM AUT GHA NLD CHN
    1: 433 132  58  50  35  20  NA  NA  NA  NA  NA  NA  NA  NA
    2:  NA  NA  NA  NA  NA  20  42  40   7  NA  NA  NA  NA  NA
    3:  30  NA  NA  78  NA  NA  NA  NA  NA  63  18   5  NA  NA
    4:  NA  NA  NA  21  NA  NA  NA  NA  NA  NA  NA  NA  53  20