Search code examples
rstringrawnon-breaking-characters

Replacing characters in R string based on raw hex values


Suppose I have a string in R,

mystring = 'help me'

but with a twist: The space between 'help' and 'me' is actually a non-breaking space. Non-breaking space is stored in R as <c2 a0>, so this string can be created by

mystring = rawToChar(as.raw(as.hexmode(c('68','65','6c','70','c2','a0','6d','65'))))

Then, for example, grepl('help me', mystring) will be FALSE

how can I replace the non-breaking space with a regular space? And in general, replace any particular raw value(s) with a particular character? Ideally I will be able to make a function like

gsubRaw('mystring',as.raw(as.hexmode(c(('c2','a0'))), ' ')

This answer almost answers my question, except that I don't want to replace ALL non-ascii characters with a space, only the non breaking space.

grepRaw() also came close, because it can detect the position in the string that the raw characters occur and they can then be replaced. However, it didn't work cleanly: sometimes the position in the string that grepRaw() returned wasn't the same as the position of the non-breaking space in the string-as-plain-text, and I don't know how to replace the raw values themselves.


Solution

  • From comments on my answer to the other question we can do this by using the fact that the non-breaking space is \xc2\xa0 (at least in R 4.3.1 on Windows)

    mystring = rawToChar(as.raw(as.hexmode(c('68','65','6c','70','c2','a0','6d','65'))))
    grepl('help me', mystring)
    #> [1] FALSE
    tools::showNonASCII(mystring)
    #> 1: help<c2><a0>me
    
    identical('help\xc2\xa0me', mystring)
    #> [1] TRUE
    
    mynewstring = gsub('\xc2\xa0+', ' ', mystring)
    grepl('help me', mynewstring)
    #> [1] TRUE
    tools::showNonASCII(mynewstring)
    

    Created on 2023-07-05 with reprex v2.0.2