Search code examples
rstringuppercaselowercase

Convert part of string to upper (or lower) case


I have a vector with sample locations, here's a sample:

test <- c("Aa, Heeswijk T1", "Aa, Heeswijk t1", 
          "Aa, Middelrode t2", "Aa, Middelrode p1",
          "Aa, Heeswijk t1a", "Aa, Heeswijk t3b",
          "Aa, test1 T1", "Aa, test2 t1")

These strings are made out of a location name ("Aa, Heeswijk"), a route code ("T1", "p2", "t3") and sometimes a subroute ("a" or "b"). Unfortunately the route codes (t1, t2, p1, t1a) are sometimes in upper and sometimes in lower case. I want to have all the route codes in UPPER case, leaving the name and subroute unchanged. My expected outcome is:

"Aa, Heeswijk T1", "Aa, Heeswijk T1", 
"Aa, Middelrode T2", "Meander Assendelft P1",
"Aa, Heeswijk T1a", "Aa, Heeswijk T3b"
"Aa, test1 T1", "Aa, test2 T1"

I have looked at toupper() but that changes to whole string. I could also use gsub:

gsub("t1","T1", test)
gsub("t2","T2", test)
#etc.

But there must be a better R-ish way?!
Note: Route codes are always 2 chars long, have a char and an integer and are preceded by a space. So the char to change to upper is always located at the second or third from last.


Solution

  • We can use regex lookarounds. We match and capture a word starting with lower case letter followed by regex lookahead number ((?=[0-9])) as a group (using parentheses) and in the replacement we use \\U followed by the capture group to convert it to upper case.

     sub('\\b([a-z])(?=[0-9])', '\\U\\1', test, perl=TRUE)
     #[1] "Aa, Heeswijk T1"       "Aa, Heeswijk T1"       "Aa, Middelrode T2"    
     #[4] "Meander Assendelft P1" "Aa, Heeswijk T1a"      "Aa, Heeswijk T3b"    
    

    Or without using the lookarounds, we can do this with two capture groups.

     sub('\\b([a-z])([0-9])', '\\U\\1\\2', test, perl=TRUE)
    

    Update

    Testing with the updated 'test' from the OP's post

    sub('\\b([a-z])(?=[0-9])', '\\U\\1', test, perl=TRUE)
    #[1] "Aa, Heeswijk T1"   "Aa, Heeswijk T1"   "Aa, Middelrode T2"
    #[4] "Aa, Middelrode P1" "Aa, Heeswijk T1a"  "Aa, Heeswijk T3b" 
    #[7] "Aa, test1 T1"      "Aa, test2 T1"