I've got a vector of (human) names, all in capitals:
names <- c("FRIEDRICH SCHILLER", "FRANK O'HARA", "HANS-CHRISTIAN ANDERSEN")
To decapitalize (capitalize the first letters only) so far, I was using
simpleDecap <- function(x) {
s <- strsplit(x, " ")[[1]]
paste0(substring(s, 1,1), tolower(substring(s, 2)), collapse=" ")
}
sapply(names, simpleDecap, USE.NAMES=FALSE)
# [1] "Friedrich Schiller" "Frank O'hara" "Hans-christian Andersen"
But I also want to account for for '
and -
. Using s <- strsplit(x, " |\\'|\\-")[[1]]
of course finds the right letters, but then in the collapse '
and -
get lost. Hence, I tried
simpleDecap2 <- function(x) {
for (char in c(" ", "\\-", "\\'")){
s <- strsplit(x, char)[[1]]
x <-paste0(substring(s, 1,1), tolower(substring(s, 2)), collapse=char)
} return x
}
sapply(names, simpleDecap, USE.NAMES=FALSE)
but that's even worse, of course, as the results are split one after the other:
sapply(names, simpleDecap2, USE.NAMES=FALSE)
# [1] "Friedrich schiller" "Frank o'Hara" "Hans-christian andersen"
I think the right approach splits according s <- strsplit(x, " |\\'|\\-")[[1]]
, but the paste=
is the problem.
This seems to work, using Perl compatible regular expressions:
gsub("\\b(\\w)([\\w]+)", "\\1\\L\\2", names, perl = TRUE)
\L
transforms the following match group to lower case.