Search code examples
rfunctiondata.tablegsubvarying

how to apply varying gsub pattern (variable function) to each row of data.table in R


I've got a data.table DT with a string column and a numeric column that indicates how many words from the start of the string should be extracted.

    > require(data.table)
    > DT <- data.table(string_col = c("A BB CCC", "DD EEE FFFF GDG", "AB DFD EFGD ABC DBC", "ABC DEF") 
                     , first_n_words = c(2, 3, 3, 1))
    > DT
                string_col first_n_words
    1:            A BB CCC             2
    2:     DD EEE FFFF GDG             3
    3: AB DFD EFGD ABC DBC             3
    4:             ABC DEF             1

I'd like to add a new column with the first-n-words of the string_col, as following:

> output_DT
            string_col first_n_words output_string_col
1:            A BB CCC             2              A BB
2:     DD EEE FFFF GDG             3       DD EEE FFFF
3: AB DFD EFGD ABC DBC             3       AB DFD EFGD
4:             ABC DEF             1               ABC

This is the gsub syntax that can be used:

> gsub(paste0("^((\\w+\\W+){", first_n_words - 1, "}\\w+).*$"),"\\1", string_col)

I basically need to create this gsub function for every row, using first_n_words of that row before applying it to string_col of that row. I'm only interested in a data.table syntax solution as it's a very large data set. a gsub solution would be most desired.


Edit: I've tried the following and it doesn't work

> DT[, output_string_col := gsub(paste0("^((\\w+\\W+){", first_n_words - 1, "}\\w+).*$"),"\\1", string_col)]
Warning message:
In gsub(paste0("^((\\w+\\W+){", first_n_words - 1, "}\\w+).*$"),  :
  argument 'pattern' has length > 1 and only the first element will be used
>## This is not the desired output    
> DT 
                string_col first_n_words output_string_col
    1:            A BB CCC             2              A BB
    2:     DD EEE FFFF GDG             3            DD EEE
    3: AB DFD EFGD ABC DBC             3            AB DFD
    4:             ABC DEF             1           ABC DEF

This is not the desired output


Solution

  • An answer to keep your use of data.table is to use a grouping operation, as you need a value in gsub, not a vector:

    DT[,line := .I]
    DT[, output_string_col := gsub(paste0("^((\\w+\\W+){", first_n_words - 1, "}\\w+).*$"),"\\1", string_col),by = line]
    
    > DT
                string_col first_n_words line output_string_col
    1:            A BB CCC             2    1              A BB
    2:     DD EEE FFFF GDG             3    2       DD EEE FFFF
    3: AB DFD EFGD ABC DBC             3    3       AB DFD EFGD
    4:             ABC DEF             1    4               ABC
    

    Edit

    As @Franck remarqued the grouping should be on first_n_wordsto be more efficient

    DT[, output_string_col := gsub(paste0("^((\\w+\\W+){", first_n_words[1] - 1, "}\\w+).*$"),"\\1", string_col),by = first_n_words]
    

    the benchmark with this modified version gives faster results :

    library(microbenchmark)
    
    denis <- function(x){
      x[, output_string_col := gsub(paste0("^((\\w+\\W+){", first_n_words[1] - 1, "}\\w+).*$"),"\\1", string_col),by = first_n_words]
    }
    
    
    
    Tim <- function(x){
      x[, output_string_col := apply(x, 1, function(x) {
        gsub(paste0("^((\\w+\\W+){", as.numeric(x[2]) - 1, "}\\w+).*$"), "\\1", x[1])
      })]
    }
    
    miss <- function(x){
      x[, output_string_col := stringr::word(string_col, end = first_n_words)]
    }
    
    DT <- DT[sample(1:4, 1000, replace = TRUE),]
    
    microbenchmark(
      Tim(DT),
      miss(DT),
      denis(DT)
    )
    
    Unit: milliseconds
          expr       min        lq      mean    median        uq
       Tim(DT) 56.851716 57.836126 60.435164 58.714486 60.753051
      miss(DT) 11.042056 11.516928 12.427029 11.871800 12.617031
     denis(DT)  1.993437  2.355283  2.555936  2.615181  2.680001
            max neval
     111.169277   100
      20.916932   100
       3.530668   100