Search code examples
rtidyrdelimiterdata-wrangling

Separating data by delimiter in R: How can I specify at which delimiter (for example the 4th in a series of 5) that characters are separated?


I have a column of unique photo IDs in a data frame. Each photo ID is comprised of a long string of characters and separated by a "_".

I want to split these IDs into columns using the delimiter "_".

This sounds easy enough, but I would like to split the characters into two specific columns at the last delimiter. For example if one of my photo IDs were:

"TV_Banana_122_Afternoon_Pre"

And I am trying to get the resulting split to be:

"TV_Banana_122_Afternoon" and "Pre"

Is there any way to specify that I want to separate this phrase by the 4th delimiter only?


Solution

  • You could use strsplit and the following regular expression to separate the string and then do.call and rbind to create a new dataframe with each part in its own column.

    CODE TO MATCH SPECIFIC # OF UNDERSCORES

    df <- data.frame(x = c("TV_Banana_122_Afternoon_Pre"))
    
    df_new <- data.frame(do.call("rbind", strsplit(sub('(^[^_]+_[^_]+_[^_]+_[^_]+)_(.*)$', '\\1 \\2', df), ' ')))
    
    df_new
    

    OUTPUT

                           X1  X2
    1 TV_Banana_122_Afternoon Pre
    

    Per the comment by @AnilGoyal, if you needed to match an additional underscore you would just need to add an additional _[^_]+ to the first match in sub. See example below.

    CODE TO MATCH A STRING WITH AN ADDITIONAL UNDERSCORE

    df2 <- data.frame(x = c("TV_Banana_122_Afternoon_Test_Pre"))
    
    df2_new <- data.frame(do.call("rbind", strsplit(sub('(^[^_]+_[^_]+_[^_]+_[^_]+_[^_]+)_(.*)$', '\\1 \\2', df2), ' ')))
    
    df2_new
    

    OUTPUT

                                X1  X2
    1 TV_Banana_122_Afternoon_Test Pre
    

    Also, if you have strings with varying #s of underscores, but you always want to split at the last underscore, you could just match the whole string up to the last underscore per the regex below.

    CODE TO MATCH THE LAST UNDERSCORE

    df_new2 <- data.frame(do.call("rbind", strsplit(sub('(.*)_(.*)$', '\\1 \\2', df), ' ')))
    df_new2
    

    OUTPUT

                       X1  X2
    1 TV_Banana_122_Afternoon Pre