My Goal is to identify whether a given text
has a target
string in it, but i want to allow for typos / small derivations and extract the substring that "caused" the match (to use it for further text analysis).
Example:
target <- "target string"
text <- "the target strlng: Butter. this text i dont want to extract."
Desired Output:
I would like to have target strlng
as the Output, since ist very Close to the target (levenshtein distance of 1). And next i want to use target strlng
to extract the word Butter
(This part i have covered, i just add it to have a detailed spec).
What i tried:
Using adist did not work, since it compares two strings, not substrings.
Next i took a look at agrep
which seems very Close. I can have the Output, that my target was found, but not the substring
that "caused" the match.
I tried with value = TRUE
but it seems to work on Array Level. I think It is not possible for me to Switch to Array type, because i can not split by spaces (my target string might have spaces,...).
agrep(
pattern = target,
x = text,
value = TRUE
)
Use aregexec
, it's similar to the use of regexpr/regmatches
(or gregexpr
) for exact matches extraction.
m <- aregexec('string', 'text strlng wrong')
regmatches('text strlng wrong', m)
#[[1]]
#[1] "strlng"
This can be wrapped in a function that uses the arguments of both aregexec
and regmatches
. Note that in the latter case, the function argument invert
comes after the dots argument ...
so it must be a named argument.
aregextract <- function(pattern, text, ..., invert = FALSE){
m <- aregexec(pattern, text, ...)
regmatches(text, m, invert = invert)
}
aregextract(target, text)
#[[1]]
#[1] "target strlng"
aregextract(target, text, invert = TRUE)
#[[1]]
#[1] "the "
#[2] ": Butter. this text i dont want to extract."