I have hypothetical data.table
library(data.table)
names <- c("PARQUE NACIONAL VOLCáN ISLUGA", "CARIQUIMA","LASANA","YALQUINCHA","FALDA VOLCAN SAN PEDRO","EL MORRO (PARTICULAR)",
"SANTA ELENA PART.","PATACON (PARTICULAR)(TRABAJO SOCIAL)","MORHUILLA (P)","SAN SEBASTIAN Y OTROS PART", "MORANDE (CONVENIO)",
"LATIGUILLO Y CHACAY (DENTRO)","RINCON LOS PERALES","PERALES BIOBIO","PARTE DEL FUNDO SAN MIGUEL")
dt <- data.table(ID=seq(1:length(names)), names=names)
In addition, I have a vector of string patterns (“patterns”) to create a new column (“exact_match“) in data.table “dt” with resulting values equals TRUE/FALSE as a result of exact matching between column “names” and the vector “patterns”.
patterns <- c("(PARTICULAR)", "PART.","(P)","PART","(CONVENIO)", "(TRABAJO SOCIAL)")
Then, I am attempt with the following code
# Create a regular expression pattern that matches any of the given patterns
pattern_regex <- paste0("\\b(?:", paste0(patterns, collapse = "|"), ")\\b")
# Use grepl to check for matches with the pattern_regex for each name
exact_match <- grepl(pattern_regex, dt$names, ignore.case = TRUE)
dt[, exact_match := exact_match]
The new data.table “dt” look like
> dt
ID names exact_match
1: 1 PARQUE NACIONAL VOLCáN ISLUGA FALSE
2: 2 CARIQUIMA FALSE
3: 3 LASANA FALSE
4: 4 YALQUINCHA FALSE
5: 5 FALDA VOLCAN SAN PEDRO FALSE
6: 6 EL MORRO (PARTICULAR) TRUE
7: 7 SANTA ELENA PART. TRUE
8: 8 PATACON (PARTICULAR)(TRABAJO SOCIAL) TRUE
9: 9 MORHUILLA (P) TRUE
10: 10 SAN SEBASTIAN Y OTROS PART TRUE
11: 11 MORANDE (CONVENIO) TRUE
12: 12 LATIGUILLO Y CHACAY (DENTRO) FALSE
13: 13 RINCON LOS PERALES FALSE
14: 14 PERALES BIOBIO FALSE
15: 15 PARTE DEL FUNDO SAN MIGUEL TRUE
The code works for almost all string values of column “names.” However, it doesn´t work for the last row where the value for exact_match
field must be exact_match= FALSE
, because the character “PARTE”
does not match exactly with any character from vector "patterns".
The expected output for the last row must be
> dt[15, exact_match]
[1] FALSE
Any help would be greatly appreciated.
although your problem seems to be fixed, here is my take on this problem:
the pattern
"pattern_regex <- \\(PARTICULAR\\)|PART\\.|\\(P\\)|PART|\\(CONVENIO\\)|\\(TRABAJO SOCIAL\\)"
matches your needs. Admittedly it looks very complicated, because in R you have to escape the escape-backslashes. But then you can just call the standard R-function on your column:
grepl(pattern_regex, dt$names)
The reason was already explained in the comment to your question. You were matching the E
in PARTE
using the wildcard .
. So I escaped it to literally match only periods. Also, I escaped the round brackets to also match them.
Long story short, I strongly recommend using the package {regexplain}
by Garrick Aden-Buie (https://www.garrickadenbuie.com/project/regexplain/); with it you get immediate visual feedback of your matches, using the function
regexplain::view_regex(dt$names, pattern_regex)
Then start the addin to further develop your regex like that
regexplain:::regexplain_addin()