Search code examples
rstringsplitmultiple-columns

Using R to split a structured string into multiple columns and set column names based on string values


I have a column of character string with each row containing information for multiple columns, including the column names. Every cell is structured so one can easily see the different pieces of information, but it is not easy to work with in this format. A cell is formatted like 'variable: value', with multiple inputs separated by commas. I would like to split the column into separate columns with the column names based on what's written before the colon. Here's an example.

my_df <- tibble(
address = c("street: AvenueName, number: 1, code: 1234 AB, city: City One", 
"street: AnotherStreet, number: 99, code: 5678 CD, city: Town Two")
)

I know I can split strings by using separate_wider_delim and then assign the column names to the data.frame. While this works, I would like to see a method that does not depend on setting the column names 'manually', but derive them from the values inside the string.

My expected dataframe would look like this:

expected_df <- tibble(
street= c("AvenueName", "AnotherStreet"),
number = c(1, 99),
code = c("1234 AB", "5678 CD"),
city = c("City One", "Town Two")
)

Solution

  • Without hard-coding column names, here is one approach -

    library(dplyr)
    library(tidyr)
    
    my_df %>%
      #To keep track of each row of data 
      mutate(row = row_number()) %>%
      # Bring data in separate rows splitting on comma
      separate_rows(address, sep = ",\\s*") %>%
      # Split data on colon to get data in two columns
      separate_wider_delim(address, ": ", names = c("col", "value")) %>%
      # Get data in wide format
      pivot_wider(names_from = col, values_from = value) %>%
      # Drop row column
      select(-row)
    
    #  street        number code    city    
    #  <chr>         <chr>  <chr>   <chr>   
    #1 AvenueName    1      1234 AB City One
    #2 AnotherStreet 99     5678 CD Town Two
    

    data

    I have taken the liberty to fix your data to match it with the expected output.

    my_df <- tibble(
      address = c("street: AvenueName, number: 1, code: 1234 AB, city: City One", 
                "street: AnotherStreet, number: 99, code: 5678 CD, city: Town Two")
    )