Search code examples
rregexstringemojistringr

How can I replace '\U' using regular expressions?


The question is pretty simple. I'm trying to replace "\U" throughout a vector of strings, and for this I'm using the package {stringr}, but I'm having issues matching the pattern.

text <- "\U0001f517"

stringr::str_detect(text, "\U")
#> Error: '\U' used without hex digits in character string starting ""\U"

stringr::str_detect(text, "\\U")
#> Error in stri_detect_regex(string, pattern, negate = negate, opts_regex = opts(pattern)) : 
#>   Unrecognized backslash escape sequence in pattern. (U_REGEX_BAD_ESCAPE_SEQUENCE, context=`\U`)

stringr::str_detect(text, "\\\U")
#> Error: '\U' used without hex digits in character string starting ""\\\U"

stringr::str_detect(text, "\\\\U")
#> FALSE

stringr::str_detect(text, "\\\\\U")
#> Error: '\U' used without hex digits in character string starting ""\\\\\U"

stringr::str_detect(text, "\\\\\\U")
#> Error in stri_detect_regex(string, pattern, negate = negate, opts_regex = opts(pattern)) : 
#>   Unrecognized backslash escape sequence in pattern. (U_REGEX_BAD_ESCAPE_SEQUENCE, context=`\\\U`)

stringr::str_detect(text, "\\\\\\\U")
#> Error: '\U' used without hex digits in character string starting ""\\\\\\\U"

# ... you get the idea

As far as I can tell, this issue is because the regex engine sees "\U" as indicating the beginning of a new hex code, as indicated by the first error. Other characters work fine:

text <- "\a0001f517"

stringr::str_detect(text, "\a")
#> TRUE

I've seen other questions around this issue, e.g. here, but still can't get this to work. Can anyone give me a working regex for this?


Solution

  • \U in your text <- "\U0001f517" is not a separate char sequence, it is part of the Unicode character code point notation. The literal text in the text variable is in fact 🔗, you can easily check that using cat(text).

    On the contrary, "\a" is a single character (a "Bell" character) that can also be written as "\u0007" or "\x07" (run "\a" == '\x07' and you will see that the output is TRUE). See more about string escape sequences syntax.

    In R, to get the underlying string literal as a literal string, you can use

    text <- "\U0001f517"
    cat(text)
    ## => 🔗 
    
    library("utf8")
    text <- utf8_encode(text)
    cat(text)
    ## => \U0001f517