Search code examples
rregexgsubnames

Formatting Character strings (First and Last Names) in a long character vector in R


I have many names of people in my character vector:

MLB$Names[1:4] [1] "Derek Jeter" "Robinson Cano" "Nick Markakis" "David Ortiz"

I would like to format them to contain the first inital, with a period, then followed by a space and their last name. I want it to look like the following

MLB$NamesFormatted[1:4] [1] "D. Jeter" "R. Cano" "N. Markakis" "D. Ortiz"

I'm assuming the best way to attack this would be by using grep or sub, but I can't for the life of me figure it out. I'm still a rookie at using R, but I'm loving all of its capabilities!

Any help would be greatly appreciated! Thank you!


Solution

  • We can use sub by capturing the first character as a group (^(.)) followed by one or more non-white spaces (\\S+) followed by another capture group of one or more white space succeeded by one or more characters ((\\s+.*)) to the end ($) of the string and replace by the first backreference (\\1) followed by a . followed by second backreference (\\2).

    sub("^(.)\\S+(\\s+.*)$", "\\1.\\2", MLB$Names)
    #[1] "D. Jeter"    "R. Cano"     "N. Markakis" "D. Ortiz"  
    

    Or it can be done with a compact code of matching one or more lower case letters ([a-z]+) and replace it with ..

    sub("[a-z]+", ".", MLB$Names)
    #[1] "D. Jeter"    "R. Cano"     "N. Markakis" "D. Ortiz"  
    

    Here is another option with strsplit where we split by one or more lower case letters followed by one or more spaces ([a-z]+\\s+), loop over the list with vapply and paste the strings together.

    vapply(strsplit(MLB$Names, "[a-z]+\\s+"), paste, collapse=". ", character(1))
    #[1] "D. Jeter"    "R. Cano"     "N. Markakis" "D. Ortiz"   
    

    Data

    MLB <- data.frame(Names = c("Derek Jeter", "Robinson Cano", 
                  "Nick Markakis", "David Ortiz"), stringsAsFactors=FALSE)