Example DF:
country <- c('Australia', 'Italy', 'Peru', 'China')
score <- c("0.091", "0.413,.", "-", "0.102,0.102,0.102,.,.,.,.,.,.,.,.")
country_scores <- data.frame(country, score)
Each entry of score can have any number of comma separated values or a "-" for no data. I am looking to extract the largest value in the string and test whether it meets a certain threshold. I tried a solution from https://stackoverflow.com/a/65121200/8621123 but the solution is very slow (at least 8 minutes) on my data frame of 1.3 million rows and 186 columns:
library(tidyverse)
country_scores %>%
mutate(scores = str_extract_all(score, '\\d+(\\.\\d+)?'),
score_max = map_dbl(new, ~max(as.numeric(.x))))
See if this data.table
approach works for your data
library(data.table)
setDT(country_scores)
country_scores[, max_score := sapply(strsplit(score, ", *"), \(x)
last(sort(as.numeric(x[grepl("^\\d+\\.*\\d+$", x)])))[1])]
output
country score max_score
<char> <char> <num>
1: Australia 0.091 0.091
2: Italy 0.413,. 0.413
3: Peru - NA
4: China 0.102,0.102,0.102,.,.,.,.,.,.,.,. 0.102
Or with max
,a little faster, but needs to suppressWarnings
and puts Inf per default.
suppressWarnings(
country_scores[, max_score := sapply(strsplit(score, ", *"), \(x)
max(as.numeric(x[grepl("^\\d+\\.*\\d+$", x)])))])
Note that you can always turn the data back to a base dataframe with setDF(country_scores)
All single core, since on Intel Mac. data.table
will profit from multi core systems and provide more speed up on Linux and Windows (or OpenMP-enabled Mac). country_scores extended to 100,000 all unique lines.
Unit: seconds
expr min lq mean median uq max
dplyR_orig 31.888486 32.514975 33.326635 33.351345 34.236091 34.74258
dplyR_slicemax 43.597723 44.021222 45.348425 45.635679 46.019922 47.77643
data.table_sort 17.008924 17.382891 17.658448 17.589783 17.845170 18.49448
data.table_max 7.556219 7.585715 7.648081 7.631336 7.662826 7.84545
neval cld
10 a
10 b
10 c
10 d
library(microbenchmark)
microbenchmark(
dplyR_orig = {suppressWarnings(country_scores %>%
mutate(scores = str_extract_all(score, '\\d+(\\.\\d+)?'),
score_max = purrr::map_dbl(scores, ~max(as.numeric(.x)))))},
dplyR_slicemax = {suppressWarnings(country_scores %>%
separate_longer_delim(score, ",") %>%
slice_max(as.numeric(score), with_ties = F, by = country))},
data.table_sort = {setDT(country_scores);
country_scores[, max_score := sapply(strsplit(score, ", *"), \(x)
data.table::last(sort(as.numeric(x[grepl("^\\d+\\.*\\d+$", x)])))[1])]},
data.table_max = {setDT(country_scores);
suppressWarnings(country_scores[, max_score := sapply(strsplit(score, ", *"), \(x)
max(as.numeric(x[grepl("^\\d+\\.*\\d+$", x)])))])}, times=10)