Search code examples
rdataframeparsingtidyrpurrr

R: More efficent way to create a new data frame column containing the maximum value from a string


Example DF:

country <- c('Australia', 'Italy', 'Peru', 'China')
score <- c("0.091", "0.413,.", "-", "0.102,0.102,0.102,.,.,.,.,.,.,.,.")
country_scores <- data.frame(country, score)

Each entry of score can have any number of comma separated values or a "-" for no data. I am looking to extract the largest value in the string and test whether it meets a certain threshold. I tried a solution from https://stackoverflow.com/a/65121200/8621123 but the solution is very slow (at least 8 minutes) on my data frame of 1.3 million rows and 186 columns:

library(tidyverse)

country_scores %>%
  mutate(scores = str_extract_all(score, '\\d+(\\.\\d+)?'), 
         score_max = map_dbl(new, ~max(as.numeric(.x))))

Solution

  • See if this data.table approach works for your data

    library(data.table)
    
    setDT(country_scores)
    
    country_scores[, max_score := sapply(strsplit(score, ", *"), \(x) 
      last(sort(as.numeric(x[grepl("^\\d+\\.*\\d+$", x)])))[1])]
    

    output

         country                             score max_score
          <char>                            <char>     <num>
    1: Australia                             0.091     0.091
    2:     Italy                           0.413,.     0.413
    3:      Peru                                 -        NA
    4:     China 0.102,0.102,0.102,.,.,.,.,.,.,.,.     0.102
    

    Or with max,a little faster, but needs to suppressWarnings and puts Inf per default.

    suppressWarnings(
      country_scores[, max_score := sapply(strsplit(score, ", *"), \(x) 
        max(as.numeric(x[grepl("^\\d+\\.*\\d+$", x)])))])
    

    Note that you can always turn the data back to a base dataframe with setDF(country_scores)

    A little benchmark

    All single core, since on Intel Mac. data.table will profit from multi core systems and provide more speed up on Linux and Windows (or OpenMP-enabled Mac). country_scores extended to 100,000 all unique lines.

    Unit: seconds
                expr       min        lq      mean    median        uq      max
          dplyR_orig 31.888486 32.514975 33.326635 33.351345 34.236091 34.74258
      dplyR_slicemax 43.597723 44.021222 45.348425 45.635679 46.019922 47.77643
     data.table_sort 17.008924 17.382891 17.658448 17.589783 17.845170 18.49448
      data.table_max  7.556219  7.585715  7.648081  7.631336  7.662826  7.84545
     neval  cld
        10 a   
        10  b  
        10   c 
        10    d
    

    benchmark plot

    library(microbenchmark)
    
    microbenchmark(
      dplyR_orig = {suppressWarnings(country_scores %>% 
           mutate(scores = str_extract_all(score, '\\d+(\\.\\d+)?'), 
                  score_max = purrr::map_dbl(scores, ~max(as.numeric(.x)))))},
      dplyR_slicemax = {suppressWarnings(country_scores %>% 
           separate_longer_delim(score, ",") %>% 
           slice_max(as.numeric(score), with_ties = F, by = country))},
      data.table_sort = {setDT(country_scores); 
         country_scores[, max_score := sapply(strsplit(score, ", *"), \(x) 
           data.table::last(sort(as.numeric(x[grepl("^\\d+\\.*\\d+$", x)])))[1])]},
      data.table_max = {setDT(country_scores); 
         suppressWarnings(country_scores[, max_score := sapply(strsplit(score, ", *"), \(x) 
           max(as.numeric(x[grepl("^\\d+\\.*\\d+$", x)])))])}, times=10)