Search code examples
rregexstringdata-manipulation

Removing Everything to the Right of a REGEX Expression


I have the following data that looks like this:

my_data = c("red A1B 5L2  101", "blue A1C 5L8  10872", "Green A1D 5L5  100003" )

Starting from the right hand side of each string, I wanted to remove the number as well as the spaces before the number.

The final result would look something like this:

[1] "red A1B 5L2"   "blue A1C 5L8"  "Green A1D 5L5"

I know that there is a regex pattern that appears in each string in the following format: '(([A-Z] ?[0-9]){3})|.', '\\1'

Thus, I want to identify the position where this regex pattern finishes and the position where the string finishes - then I could delete the space between these two positions and obtain the desired result.

I found this link which shows how to remove all characters in a string appearing to the left or to the right of a certain pattern (https://datascience.stackexchange.com/questions/8922/removing-strings-after-a-certain-character-in-a-given-text). I tried to apply the logic provided here to my example:

gsub("(([A-Z] ?[0-9]){3})|.', '\\1.*","",my_data)

But this is producing the opposite result!

[1] "red   101"      "blue   10872"   "Green   100003"

Can someone please show me how to resolve this problem?


Solution

  • We can use sub() here:

    my_data <- c("red A1B 5L2  101", "blue A1C 5L8  10872", "Green A1D 5L5  100003" )
    output <- sub("\\s+\\d+$", "", my_data)
    output
    
    [1] "red A1B 5L2"   "blue A1C 5L8"  "Green A1D 5L5"
    

    The regex pattern used here is \s+\d+$ and matches one or more whitespace characters followed by one or more digits at the end of the string.