Search code examples
rregexstringi

R - grep() matches, but str_remove_all() fails with non-ascii characters


I am parsing a table from a PDF and attempting to clean my readings. I'm trying to do a simple str_remove_all() of some common failures in my OCR. I've created a regular expression that matches my strings, however, when I put the regular expression into str_remove_all() it fails. See the code below:

> regexpattern <- '^(\\s*[[:punct:]]*\\s*)+|^\\s*\\d{1,2}\\s*$|^\\s+$|(^\\s* 
[[:ascii:]]{1}\\s*$)|(^\\s*[^[:ascii:]]{1}\\s*$)'

>   strg <- "Ú"

>  grep(regexpattern,strg, perl = T)
[1] 1

>  str_remove_all(strg,regexpattern)
[1] "Ú"

Any ideas as to why my str_remove_all() is failing? Thanks!


Solution

  • EDIT: Removed prior answer that while technically worked wasn't the issue.

    The problem is that your regex isn't doing what I think you intend it to do. The first alternative in your regex is \\s*[[:punct]]*\\s*. None of those elements need to be present, which means that everything will match and none of your other alternatives will be checked. So when stringr::str_remove_all uses that regex it will run through the alternatives on that first character, use the first alternative since it matches, and ignore the character moving on. Change it to \\s*[[:punct]]+\\s* and you'll have better success.

    > regexpattern <- '^(\\s*[[:punct:]]*\\s*)+'
    > grep(regexpattern,strg, perl = T, value=TRUE)
    [1] "Ú"
    > regexpattern <- '^(\\s*[[:punct:]]+\\s*)+'
    > grep(regexpattern,strg, perl = T, value=TRUE)
    character(0)
    > regexpattern <- '^(\\s*[[:punct:]]+\\s*)+|^\\s*\\d{1,2}\\s*$|^\\s+$|(^\\s*[[:ascii:]]{1}\\s*$)|(^\\s*[^[[:ascii:]]]{1}\\s*$)'
    > str_remove_all(strg, regexpattern)
    [1] ""