Search code examples
rcapitalize

Decapitalize human names (accounting for ' and -)


I've got a vector of (human) names, all in capitals:

names <- c("FRIEDRICH SCHILLER", "FRANK O'HARA", "HANS-CHRISTIAN ANDERSEN")

To decapitalize (capitalize the first letters only) so far, I was using

simpleDecap <- function(x) {
  s <- strsplit(x, " ")[[1]] 
  paste0(substring(s, 1,1), tolower(substring(s, 2)), collapse=" ")
  }
sapply(names, simpleDecap, USE.NAMES=FALSE)
# [1] "Friedrich Schiller"         "Frank O'hara"         "Hans-christian Andersen"

But I also want to account for for ' and -. Using s <- strsplit(x, " |\\'|\\-")[[1]] of course finds the right letters, but then in the collapse ' and - get lost. Hence, I tried

simpleDecap2 <- function(x) {
  for (char in c(" ", "\\-", "\\'")){
    s <- strsplit(x, char)[[1]] 
    x <-paste0(substring(s, 1,1), tolower(substring(s, 2)), collapse=char)
  } return x
  }

sapply(names, simpleDecap, USE.NAMES=FALSE)

but that's even worse, of course, as the results are split one after the other:

sapply(names, simpleDecap2, USE.NAMES=FALSE)
# [1] "Friedrich schiller"      "Frank o'Hara"            "Hans-christian andersen"

I think the right approach splits according s <- strsplit(x, " |\\'|\\-")[[1]], but the paste= is the problem.


Solution

  • This seems to work, using Perl compatible regular expressions:

    gsub("\\b(\\w)([\\w]+)", "\\1\\L\\2", names, perl = TRUE)
    

    \L transforms the following match group to lower case.