Search code examples
rstringsplituppercaselowercase

Splitting strings by case


I have a large text-based data frame (about 100k rows) where each row is a string that contains first lowercase letters and then uppercase letters with spaces in between. Such as below:

df1 <- data.frame(a = c('lowercase U P P E R C A S E', 'letters N U M B E R S'), 
                  stringsAsFactors = FALSE)
df1

I am trying to split the string at the point where it becomes uppercase, and move the uppercase characters into a new column (therein removing them from the original column). The desired output would then look like this:

df2 <- data.frame(a = c('lowercase', 'letters'),
                  b = c('U P P E R C A S E', 'N U M B E R S'),
                  stringsAsFactors = FALSE)
df2

I'm truthfully not sure where to begin in doing something like this. Any ideas?


Solution

  • There are a lot of different ways to do this, but the vast majority of them will use Regular Expressions

    In base R, you could do:

    df3 <- data.frame(
             a = gsub(pattern = "^([a-z]+) (([A-Z] )*[A-Z])$", replacement = "\\1", x = df1$a),
             b = gsub(pattern = "^([a-z]+) (([A-Z] )*[A-Z])$", replacement = "\\2", x = df1$a),
             stringsAsFactors = FALSE)
    

    Here, the gsub function is capturing the lowercase letters in the first group ([a-z]+), and then capturing the alternating capitals and spaces in the second group (([A-Z] )*[A-Z]). Then it replaces the whole string with the contents of the first group for column a, and the contents of the second group for column b.

    Another approach, this time using look-ahead and look-behind, and the separate function from the tidyr package:

    df4 <- tidyr::separate(df1, 
                           col = a, 
                           into = c("a", "b"), 
                           sep = "(?<=[a-z]) (?=[A-Z])")
    

    Here, the (?<=[a-z]) is a look-behind that will match any lowercase letter, and (?=[A-Z]) is a look-ahead that will match any uppercase letter. Because there is a space between the look-ahead and look-behind, it will separate the string by the first space that is directly after a lowercase letter and directly before an uppercase letter, which characterizes the space separating the two columns you are trying to create.