Search code examples
rnon-breaking-characters

Removing non-breaking space characters in R


I have dataframe with several columns and 50K plus observations. Let's name it df1. One of the variables is PLATES (denoted here as "y"), which contains plate numbers of buses in a city. I want to match this data frame with another(df2) where I also have plates data. I want to keep matching records only. While looking at the data in df1, which comes from a CSV file, I realized that for y, several observations had symbols before the plate number that correspond to non-breaking space. How do I get rid of this so that it isn't an issue when I do the matching. Here's some code to help illustrate. Let's say you have 5 plate numbers:

y <- c(0740170, 0740111, 0740119, 0740115, 0740048)

But upon further inspection

view(y)

You see the following

<c2><a0>0740170
<c2><a0>0740111
<c2><a0>0740119
<c2><a0>0740115
<c2><a0>0740048

I tried this, from this post https://blog.tonytsai.name/blog/2017-12-04-detecting-non-breaking-space-in-r/, but didn't work

y <- gsub("\u00A0", " ", y, fixed = TRUE)

I would appreciate a lot your help on how to deal with this issue. Thanks!


Solution

  • Not quite sure this will help as I can't test my answer (as I can't recreate your problem). But if non-breaking space characters are at the same time non-ASCII characters then, the solution would be this:

    y <- gsub("[^ -~]+", "", y)
    

    The pattern matches any non-ASCII characters and the replacement sets them to null. Hope this helps