Search code examples
regexrstringsplitstrsplit

String split on a number word pattern


I have a data frame that looks like this:

V1                        V2
peanut butter sandwich    2 slices of bread 1 tablespoon peanut butter

What I'm aiming to get is:

V1                        V2
peanut butter sandwich    2 slices of bread
peanut butter sandwich    1 tablespoon peanut butter

I've tried to split the string using strsplit(df$v2, " "), but I can only split by the " ". I'm not sure if you can split the string only at the first number and then take the characters until the next number.


Solution

  • You can split the string as follows:

    txt <- "2 slices of bread 1 tablespoon peanut butter"
    
    strsplit(txt, " (?=\\d)", perl=TRUE)[[1]]
    #[1] "2 slices of bread"          "1 tablespoon peanut butter"
    

    The regex being used here is looking for spaces followed by a digit. It uses a zero-width positive lookahead (?=) to say that if the space is followed by a digit (\\d), then it's the type of space we want to split on. Why the zero-width lookahead? It's because we don't want to use the digit as a splitting character, we just want match any space that is followed by a digit.

    To use that idea and construct your data frame, see this example:

    item <- c("peanut butter sandwich", "onion carrot mix", "hash browns")
    txt <- c("2 slices of bread 1 tablespoon peanut butter", "1 onion 3 carrots", "potato")
    df <- data.frame(item, txt, stringsAsFactors=FALSE)
    
    # thanks to Ananda for recommending setNames
    split.strings <- setNames(strsplit(df$txt, " (?=\\d)", perl=TRUE), df$item) 
    # alternately: 
    #split.strings <- strsplit(df$txt, " (?=\\d)", perl=TRUE)
    #names(split.strings) <- df$item
    
    stack(split.strings)
    #                      values                    ind
    #1          2 slices of bread peanut butter sandwich
    #2 1 tablespoon peanut butter peanut butter sandwich
    #3                    1 onion       onion carrot mix
    #4                  3 carrots       onion carrot mix
    #5                     potato            hash browns