I am parsing a table from a PDF and attempting to clean my readings. I'm trying to do a simple str_remove_all()
of some common failures in my OCR. I've created a regular expression that matches my strings, however, when I put the regular expression into str_remove_all()
it fails. See the code below:
> regexpattern <- '^(\\s*[[:punct:]]*\\s*)+|^\\s*\\d{1,2}\\s*$|^\\s+$|(^\\s*
[[:ascii:]]{1}\\s*$)|(^\\s*[^[:ascii:]]{1}\\s*$)'
> strg <- "Ú"
> grep(regexpattern,strg, perl = T)
[1] 1
> str_remove_all(strg,regexpattern)
[1] "Ú"
Any ideas as to why my str_remove_all()
is failing? Thanks!
The problem is that your regex isn't doing what I think you intend it to do. The first alternative in your regex is \\s*[[:punct]]*\\s*
. None of those elements need to be present, which means that everything will match and none of your other alternatives will be checked. So when stringr::str_remove_all
uses that regex it will run through the alternatives on that first character, use the first alternative since it matches, and ignore the character moving on. Change it to \\s*[[:punct]]+\\s*
and you'll have better success.
> regexpattern <- '^(\\s*[[:punct:]]*\\s*)+'
> grep(regexpattern,strg, perl = T, value=TRUE)
[1] "Ú"
> regexpattern <- '^(\\s*[[:punct:]]+\\s*)+'
> grep(regexpattern,strg, perl = T, value=TRUE)
character(0)
> regexpattern <- '^(\\s*[[:punct:]]+\\s*)+|^\\s*\\d{1,2}\\s*$|^\\s+$|(^\\s*[[:ascii:]]{1}\\s*$)|(^\\s*[^[[:ascii:]]]{1}\\s*$)'
> str_remove_all(strg, regexpattern)
[1] ""