Search code examples
rregexdata-cleaning

R separate words from numbers in string


I need to clean up some data strings that have words and numbers or just numbers.

below is a toy sample

library(tidyverse)

c("555","Word 123", "two words 123", "three words here 123") %>%  
sub("(\\w+) (\\d*)",  "\\1|\\2", .)

The result is this:

[1] "555"                  "Word|123"             "two|words 123"        "three|words here 123"

but I want to place the '|' before the last set of numbers like shown below

[1] "|555"                  "Word|123"             "two words|123"        "three words here|123"

Solution

  • We can use sub to match zero or more spaces (\\s*) followed by a digit we capture as a group ((\\d)) and in the replacement use the | followed by the backreference (\\1) of the captured group

    sub("\\s*(\\d)", "|\\1", v1)
    #[1] "|555"                 "Word|123"            
    #[3] "two words|123"        "three words here|123"
    

    data

    v1 <- c("555","Word 123", "two words 123", "three words here 123")