Search code examples
rregexdplyrsparklyr

Sparlyr, dplyr, regex extract pattern form a text variable then separated with semicolon


I'm using sparklyr and dplyr, and I've been trying to create a variable, extract_code, that would extract a certain pattern form a text variable. The pattern is 3 letters + 3 numbers. The pattern can appear several times in the same text. In this case I'd like the patterns to be separated by a semicolon

I have create this object using regex:

regex_pattern <- "[A-Za-z]{3}[0-9]{3}"

Here's what have :

test <-  data.table(id = 1:3, text= c("(table 012 APM325)", "(JUI524 toto KIO879)" , "(pink car in the field KJU547 MPO362/JHY879)"))

Here's what I'd like to have :

test <-   data.table(id = 1:3, text= c("(table 012 APM325)", "(JUI524 toto KIO879)" , "(pink car in the field KJU547 MPO362/JHY879)"), extract_code =c( "APM325", "JUI524;KIO879" , "KJU547;MPO362;JHY879"))

I've tried this:

test <- test %>%  mutate(extract_code = regexp_extract(text, regex_pattern, 0))

data.table(id = 1:3, text= c("(table 012 APM325)", "(JUI524 toto KIO879)" , "(pink car in the field KJU547 MPO362/JHY879)"), extract_code =c( "APM325", "JUI524" , "KJU547"))

But I only get the first pattern.

Do you have any tips? Thank you!

EDIT: THIS WORKS!

try <-  data.table(id = 1:3, text= c("(table 012 APM325)", "(JUI524 toto KIO879)" , "(pink car in the field KJU547 MPO362/JHY879)"))

sdf_try <- copy_to(sc, try , "try" )

extract.pattern <- function(pat) function(df) {
   f <- function(vec)  sapply(regmatches(vec, gregexpr(pat, vec)), paste0, collapse = ";")
   dplyr::mutate(df, extract_code = f(text))
 }

 sdf_try %>%
   spark_apply(extract.pattern("[A-Z]{3}[0-9]{3}"))

But this does not work :

regex_pattern <- "[A-Z]{3}[0-9]{3}"


sdf_try %>%
   spark_apply(extract.pattern(regex_pattern))

# Error: org.apache.spark.SparkException: Job aborted due to stage failure: Task 0 in stage 8.0 failed 4 times, most recent failure: Lost task 0.3 in stage 8.0 Exception: sparklyr worker rscript failure with status 255, check worker logs for details.


sdf_try %>%
   spark_apply(extract.pattern('regex_pattern'))

Solution

  • regex_pattern <- "[A-Z]{3}[0-9]{3}"
    test %>%  mutate(extract_code = sapply(regmatches(text, gregexpr(regex_pattern,text)), paste0, collapse = ";"))
    
    #  id                                         text         extract_code
    #1  1                           (table 012 APM325)               APM325
    #2  2                         (JUI524 toto KIO879)        JUI524;KIO879
    #3  3 (pink car in the field KJU547 MPO362/JHY879) KJU547;MPO362;JHY879
    

    • I've changed [A-Za-z] to [A-Z]. Correct if this does not work for you. It sure does in the example.

    • regmatches returns a list of matches. I then collapse them into single strings separated by ;.