Search code examples
rdata.tablemultiple-matches

Exact matching between a character vector with multiple patterns


I have hypothetical data.table

library(data.table)
names <- c("PARQUE NACIONAL VOLCáN ISLUGA", "CARIQUIMA","LASANA","YALQUINCHA","FALDA VOLCAN SAN PEDRO","EL MORRO (PARTICULAR)",
"SANTA ELENA PART.","PATACON (PARTICULAR)(TRABAJO SOCIAL)","MORHUILLA (P)","SAN SEBASTIAN Y OTROS PART", "MORANDE (CONVENIO)",
"LATIGUILLO Y CHACAY (DENTRO)","RINCON LOS PERALES","PERALES BIOBIO","PARTE DEL FUNDO SAN MIGUEL")

dt <- data.table(ID=seq(1:length(names)), names=names)

In addition, I have a vector of string patterns (“patterns”) to create a new column (“exact_match“) in data.table “dt” with resulting values equals TRUE/FALSE as a result of exact matching between column “names” and the vector “patterns”.

patterns <- c("(PARTICULAR)", "PART.","(P)","PART","(CONVENIO)", "(TRABAJO SOCIAL)")

Then, I am attempt with the following code

# Create a regular expression pattern that matches any of the given patterns
pattern_regex <- paste0("\\b(?:", paste0(patterns, collapse = "|"), ")\\b")

# Use grepl to check for matches with the pattern_regex for each name
exact_match <- grepl(pattern_regex, dt$names, ignore.case = TRUE)
dt[, exact_match := exact_match]

The new data.table “dt” look like

> dt
    ID                                names exact_match
 1:  1        PARQUE NACIONAL VOLCáN ISLUGA       FALSE
 2:  2                            CARIQUIMA       FALSE
 3:  3                               LASANA       FALSE
 4:  4                           YALQUINCHA       FALSE
 5:  5               FALDA VOLCAN SAN PEDRO       FALSE
 6:  6                EL MORRO (PARTICULAR)        TRUE
 7:  7                    SANTA ELENA PART.        TRUE
 8:  8 PATACON (PARTICULAR)(TRABAJO SOCIAL)        TRUE
 9:  9                        MORHUILLA (P)        TRUE
10: 10           SAN SEBASTIAN Y OTROS PART        TRUE
11: 11                   MORANDE (CONVENIO)        TRUE
12: 12         LATIGUILLO Y CHACAY (DENTRO)       FALSE
13: 13                   RINCON LOS PERALES       FALSE
14: 14                       PERALES BIOBIO       FALSE
15: 15           PARTE DEL FUNDO SAN MIGUEL        TRUE

The code works for almost all string values of column “names.” However, it doesn´t work for the last row where the value for exact_match field must be exact_match= FALSE, because the character “PARTE” does not match exactly with any character from vector "patterns".

The expected output for the last row must be

> dt[15, exact_match]
[1] FALSE

Any help would be greatly appreciated.


Solution

  • although your problem seems to be fixed, here is my take on this problem:

    the pattern

    "pattern_regex <- \\(PARTICULAR\\)|PART\\.|\\(P\\)|PART|\\(CONVENIO\\)|\\(TRABAJO SOCIAL\\)"
    

    matches your needs. Admittedly it looks very complicated, because in R you have to escape the escape-backslashes. But then you can just call the standard R-function on your column:

    grepl(pattern_regex, dt$names)
    

    The reason was already explained in the comment to your question. You were matching the E in PARTE using the wildcard .. So I escaped it to literally match only periods. Also, I escaped the round brackets to also match them.

    Long story short, I strongly recommend using the package {regexplain} by Garrick Aden-Buie (https://www.garrickadenbuie.com/project/regexplain/); with it you get immediate visual feedback of your matches, using the function

    regexplain::view_regex(dt$names, pattern_regex)
    

    Then start the addin to further develop your regex like that

    regexplain:::regexplain_addin()