Search code examples
rstringdata-manipulationstrsplit

Split a column of strings (with different patterns) based on two different conditions


Was hoping to get some help with this problem. So I have a column with two types of strings and I would need to split the strings into multiple columns using 2 different conditions. I can figure out how to split them individually but struggling to add maybe an IF statement to my code. This is the example dataset below:

data = data.frame(string=c("HFUFN-087836_661", "207465-125 - IK_6 Mar 2009.docx_37484956"))

For the first type of variable (with the _). I would like to split after the _. So I used the following code for that

strsplit(data$string, "-")

For variables that have.docx in them I would like to split after the docx. I cannot split based on "_" as it comes multiple times in this string. So I used the following code:

strsplit(data$string, "x_")

My question is both these types of strings appear in the same column. Is there a way to tell R if "docx" is in the string then split after x_, but if its not split on the _?

Any help would be appreciated - Thank you guys!


Solution

  • Here's a tidyr solution:

    library(tidyr)
    data %>%
    extract(string,
            into = c("1","2"),    # choose your own column labels
            "(.*?)_([^_]+)$")
                                    1        2
    1                    HFUFN-087836      661
    2 207465-125 - IK_6 Mar 2009.docx 37484956
    

    How the regex works:

    The regex partitions the strings into two "capture groups" plus an underscore in-between:

    • (.*?): first capture group, matching any character (.) zero or more times (*) non-greedily (?)
    • _: a literal underscore
    • ([^_]+)$: the second capture group, matching any character that is not an underscore ([^_]) one or more times (+) at the very end of he string ($)

    Data:

    data = data.frame(string=c("HFUFN-087836_661", "207465-125 - IK_6 Mar 2009.docx_37484956"))