Search code examples
rregexstrsplit

strsplit on the first comma before a colon


I've been racking my brains for a while, but i haven't come up with a good solution yet.

I have the following vector that I aim to split in R:

x <- c("Sara: has brown hair, Mary Jane: is, mostly, regarded as intelligent, Marc-Oliver: big,handsome","Elvis: loud, dead, Ray Charles: silent, dead, Rihanna: alive")

For context: the words after the colon are a further precision of the word(s) before the colon. So Sara, Mary Jane, Marc-Oliver, Elvis, Ray Charles can be seen as categories and the description as subcategory. The goal is now to split the string such that category and subcategory are together. Hence, the result shall look like this:

[1] "Sara: has brown hair" "Mary Jane: is, mostly, regarded as intelligent" "Marc-Oliver: big, handsome"
[2] "Elvis: loud, dead" "Ray Charles: silent, dead" "Rihanna: alive"

The problem is that the number of words before a colon is not the same for any pairing, similarly, the number of commas after a colon is not the same. Does anyone have an idea how to achieve that?

I tried to adapt the solutions from this (Regex expressions to match text between first comma and the comma before the first number) and this thread (Split on first comma in string), but honestly, when it comes to more complicated regular expressions, I only see characters put together.


Solution

  • Use strsplit splitting on comma, space and upper case character. Do not consume the upper case letter when splitting.

    strsplit(x, ", (?=[[:upper:]])", perl = TRUE)
    

    giving

    [[1]]
    [1] "Sara: has brown hair"                          
    [2] "Mary Jane: is, mostly, regarded as intelligent"
    [3] "Marc-Oliver: big,handsome"                     
    
    [[2]]
    [1] "Elvis: loud, dead"         "Ray Charles: silent, dead"
    [3] "Rihanna: alive"