Search code examples
rstrsplit

Split colon comma but ignore brackets in R?


I have a data frame, and I want to split the strings in the Mem column by commas and colons. Here's my example:

df <- data.frame(ID=c("AM", "UA", "AS"),
                 Mem = c("WRR(World Happiness Report Index,WHRI)(Cs):1470,Country(%):60.2,UAM(The Star Spangled Banner,TSSB)(s):1380,City(%):69.7,TSSB/Cs(%):93.88,Note:pass",
                         "WRR(World Happiness Report Index,WHRI)(Cs):2280,Country(%):96.2,UAM(The Star Spangled Banner,TSSB)(s):2010,City(%):107.5,TSSB/Cs(%):88.16,Note:pass",
                         "WRR(World Happiness Report Index,WHRI)(Cs):3170,Country(%):101.6,UAM(The Star Spangled Banner,TSSB)(s):2950,City(%):95.5,TSSB/Cs(%):93.06,Note:pass"))

I want to split the strings in the Mem column by colon and comma. The result should be:

    ID  WRR(Happiness Report Index,HRI)(Cs)  Country(%)  UAM(The Star Spangled Banner,TSSB)(s)  City(%)  TSSB/Cs(%)  Note
1:  AM  1470  60.2  1380   69.7  93.88  pass
2:  UA  2280  96.2  2010  107.5  88.16  pass
3:  AS  3170 101.6  2950   95.5  93.06  pass

Any help would be greatly appreciated!


Solution

  • The pattern ":[^,]+,*" can separate the variable names from its values. It means:

    • A :;
    • Folowed by any number of (+) characters, except commas ([^,]);
    • Then, perhaps (*), a comma ,;

    We can then save these names on a variable with:

    variables <- str_split_1(df$Mem[1], ":[^,]+,*") %>% head(-1)
    

    Obs: the pattern ends up creating an empty string at the end, hence the head(-1).

    Then, to get the values, we want the rest of the string. We can do that by removing every element of variables from it. I don't know if there's already a function that does this, but here is a custom one:

    str_remove_multiple <- function(x, patterns){
      for(i in patterns) x <- str_remove_all(x, fixed(i))
      x
    }
    

    After "cleaning" the Mem variable, we can split it by the remaining ",", and save each value to a new column based on variables:

    df %>%
      mutate(Mem = str_remove_list(Mem, c(variables, ":"))) %>%
      separate(Mem, into = variables, sep = ",") %>%
      mutate(across(-c(ID, Note), as.numeric))
    

    Result:

      ID WRR(World Happiness Report Index,WHRI)(Cs) Country(%) UAM(The Star Spangled Banner,TSSB)(s) City(%) TSSB/Cs(%) Note
    1 AM                                       1470       60.2                                  1380    69.7      93.88 pass
    2 UA                                       2280       96.2                                  2010   107.5      88.16 pass
    3 AS                                       3170      101.6                                  2950    95.5      93.06 pass