Search code examples
rdatabasestringdataframe

Cell value breakdown and string split with uneven length


I need to breakdown the string values contained in a column. Some cells do not need any split at all. Others might required one, two, or more.

I would also like the split values to be stored in concurrent columns.

For instance, if my initial dataframe looks like this

df2 <- data.frame(district= 1000:1003,
                  party= c("PartyGreen", "Coalition1(PartyRed-PartyBlue)", "PartyRed", "Coal.(PartyBlue-PartyOrange-VelvetParty)"))

I would like to look like this:

df.neat.2 <- data.frame(district= 1000:1003,
                  party= c("PartyGreen", "Coalition1(PartyRed-PartyBlue)", "PartyRed", "Coal.(PartyBlue-PartyOrange-VelvetParty)"),
                  party1= c("PartyGreen", "PartyRed", "PartyRed", "PartyBlue"),
                  party2= c(NA, "PartyBlue", NA, "PartyOrange"),
                  party3= c(NA, NA, NA, "VelvetParty"))

Note how some cells contain a single value that does not need to be split. Also note that the string split takes place inside a parenthesis () and is split by a dash.


Solution

  • library(tidyr)
    library(dplyr)
    library(stringr)
    
    df2 |>
      mutate(parties = str_remove_all(party, ".*\\(|\\).*"),
             parties = str_split(parties, fixed("-"))) |>
      unnest_wider(parties, names_sep = "_")
    #   district party                                    parties_1  parties_2   parties_3  
    #      <int> <chr>                                    <chr>      <chr>       <chr>      
    # 1     1000 PartyGreen                               PartyGreen NA          NA         
    # 2     1001 Coalition1(PartyRed-PartyBlue)           PartyRed   PartyBlue   NA         
    # 3     1002 PartyRed                                 PartyRed   NA          NA         
    # 4     1003 Coal.(PartyBlue-PartyOrange-VelvetParty) PartyBlue  PartyOrange VelvetParty
    

    If you wanted the column names to be party_1, party_2, etc. instead of parties_1, parties_2, etc. you could pipe this output to:

    rename_with(~ str_replace(., fixed("parties"), "party"), starts_with("parties"))