I have a data table with a field for counts of people from each country formatted as "XYZ:#" where XYZ is the country code and # the count.
Here is an example of 4 rows from the data table:
dt <-data.table(event = c("Event 1", "Event 2", "Event 3", "Event 4"),
desc = c("Desc1", "Desc1", "Desc2", "Desc3"),
countries = c("USA:433, MEX:132, GRC:58, GBR:50, IRL:35, ITA:20",
"ESP:42, DEU:40, ITA:20, SLE:7",
"GBR:78, JAM:63, USA:30, AUT:18, GHA:5",
"NLD:53, GBR:21, CHN:20"))
The countries field may contain any number of entries, with the only structure being that it starts from most numerous to least numerous. I need to parse this into a data table for analysis against combinations of event and descriptions. So I need a table showing the count by country code, with rows in the same order so that I can merge it back to the event and description fields.
My solution so far is very inefficient, it loops over each row and then parses the countries column first for the comma separator then for the colon. Here is what I have so far:
# Reference data for ISO 3-alpha country codes as character vector
nat_codes <- read.csv("countries_codes.csv")[[3]]
# Create empty data table
master_dt <- data.table(matrix(nrow = 0, ncol = length(nat_codes)+1))
names(master_dt) <- c(sort(nat_codes), "Unknown")
for (i in 1:nrow(dt)){
# split each string by the "," separator
row_component <- strsplit(dt$countries, ",")[[i]]
country_codes <- c()
numbers <- c()
# Loop through each component and separate country from count
for (j in 1:length(row_component)){
my_split <- strsplit(row_component[[j]], ":")[[1]]
# Extract the country code, add to vector
country_codes[j] <- gsub("\\s+", "", my_split[1])
# Extract the number
numbers[j] <- as.numeric(my_split[2])
}
# Re-assemble country codes and numbers as a data table and bind
row_dt <- data.table(country = country_codes, number = numbers)
row_dt <- transpose(row_dt)
names(row_dt) <- as.character(row_dt[1,])
row_dt <- row_dt[-1,]
master_dt <- rbindlist(list(master_dt, row_dt),
fill = TRUE,
use.names = TRUE)
}
# remove rows with no entries
master_dt <- Filter(function(x)!all(is.na(x)), master_dt)
The output I get from this is correct (although i would need to convert the type) but this approach is too inefficient to scale up.
USA MEX GRC GBR IRL ITA ESP DEU SLE JAM AUT GHA NLD CHN
1: 433 132 58 50 35 20 <NA> <NA> <NA> <NA> <NA> <NA> <NA> <NA>
2: <NA> <NA> <NA> <NA> <NA> 20 42 40 7 <NA> <NA> <NA> <NA> <NA>
3: 30 <NA> <NA> 78 <NA> <NA> <NA> <NA> <NA> 63 18 5 <NA> <NA>
4: <NA> <NA> <NA> 21 <NA> <NA> <NA> <NA> <NA> <NA> <NA> <NA> 53 20
I need to use this on hundreds of thousands of entries, but this is quite slow on a test set of c.1000 rows. What are the easiest wins here to improve performance? Base and data.table solutions preferred.
There is a secondary problem of getting country codes which are not in the reference data, but I can assign those to unknown or disregard those. Thank you.
Using sapply
and a unique
list countr of all countries.
library(data.table)
countr <- unique(sub(":.*", "", unlist(strsplit(dt$countries, ", "))))
data.table(t(sapply(strsplit(dt$countries, ", "), function(co)
sapply(countr, function(x){
res <- sub(".*:", "", grep(x, co, value=T))
ifelse(identical(res, character(0)), NA, res)}))))
USA MEX GRC GBR IRL ITA ESP DEU SLE JAM AUT GHA NLD CHN
1: 433 132 58 50 35 20 <NA> <NA> <NA> <NA> <NA> <NA> <NA> <NA>
2: <NA> <NA> <NA> <NA> <NA> 20 42 40 7 <NA> <NA> <NA> <NA> <NA>
3: 30 <NA> <NA> 78 <NA> <NA> <NA> <NA> <NA> 63 18 5 <NA> <NA>
4: <NA> <NA> <NA> 21 <NA> <NA> <NA> <NA> <NA> <NA> <NA> <NA> 53 20
For speed comparisons a slightly modified version, operating within the indexed data.table
and using lapply
countr <- unique(sub(":.*", "", unlist(strsplit(dt$countries, ", "))))
setNames(data.table(t(dt[, lapply(strsplit(countries, ", "), function(co)
lapply(countr, function(x){
res <- sub(".*:", "", grep(x, co, value=T))
ifelse(identical(res, character(0)), NA, res)}))])), countr)
USA MEX GRC GBR IRL ITA ESP DEU SLE JAM AUT GHA NLD CHN
1: 433 132 58 50 35 20 NA NA NA NA NA NA NA NA
2: NA NA NA NA NA 20 42 40 7 NA NA NA NA NA
3: 30 NA NA 78 NA NA NA NA NA 63 18 5 NA NA
4: NA NA NA 21 NA NA NA NA NA NA NA NA 53 20