I need to breakdown the string values contained in a column. Some cells do not need any split at all. Others might required one, two, or more.
I would also like the split values to be stored in concurrent columns.
For instance, if my initial dataframe looks like this
df2 <- data.frame(district= 1000:1003,
party= c("PartyGreen", "Coalition1(PartyRed-PartyBlue)", "PartyRed", "Coal.(PartyBlue-PartyOrange-VelvetParty)"))
I would like to look like this:
df.neat.2 <- data.frame(district= 1000:1003,
party= c("PartyGreen", "Coalition1(PartyRed-PartyBlue)", "PartyRed", "Coal.(PartyBlue-PartyOrange-VelvetParty)"),
party1= c("PartyGreen", "PartyRed", "PartyRed", "PartyBlue"),
party2= c(NA, "PartyBlue", NA, "PartyOrange"),
party3= c(NA, NA, NA, "VelvetParty"))
Note how some cells contain a single value that does not need to be split. Also note that the string split takes place inside a parenthesis ()
and is split by a dash.
library(tidyr)
library(dplyr)
library(stringr)
df2 |>
mutate(parties = str_remove_all(party, ".*\\(|\\).*"),
parties = str_split(parties, fixed("-"))) |>
unnest_wider(parties, names_sep = "_")
# district party parties_1 parties_2 parties_3
# <int> <chr> <chr> <chr> <chr>
# 1 1000 PartyGreen PartyGreen NA NA
# 2 1001 Coalition1(PartyRed-PartyBlue) PartyRed PartyBlue NA
# 3 1002 PartyRed PartyRed NA NA
# 4 1003 Coal.(PartyBlue-PartyOrange-VelvetParty) PartyBlue PartyOrange VelvetParty
If you wanted the column names to be party_1
, party_2
, etc. instead of parties_1
, parties_2
, etc. you could pipe this output to:
rename_with(~ str_replace(., fixed("parties"), "party"), starts_with("parties"))