Search code examples
rstrsplitnormalize

R string split, to normalized (long) format with running index


I have this data frame

structure(list(rule.id = c(1, 2), rules = structure(1:2, .Label = c("Lamp1.1,Lamp1.2", 
"Lamp2.1,Lamp2.2"), class = "factor")), .Names = c("rule.id", 
"rules"), row.names = c(NA, -2L), class = "data.frame")

#  rule.id           rules
#1       1 Lamp1.1,Lamp1.2
#2       2 Lamp2.1,Lamp2.2

which I need to split on the "rules" column by separator comma (","), multiple commas occur (not only 2 like in example) and then transform this into a normalized format with keeping the relevant rule.id value from the original df. The result should look like this:

structure(list(rule.id = c(1, 1, 2, 2), lhs = c("Lamp1.1", "Lamp1.2", 
"Lamp2.1", "Lamp2.1")), .Names = c("rule.id", "lhs"), row.names = c(NA, 
-4L), class = "data.frame")

#  rule.id     lhs
#1       1 Lamp1.1
#2       1 Lamp1.2
#3       2 Lamp2.1
#4       2 Lamp2.1

I have a code that takes care of the str split and normalize (long) format, but not sure how to take care of the rule.id requirement

lhs.norm <- as.data.frame(
  cbind(
    rules.df$ruleid, 
    unlist(strsplit(
      unlist(lapply(strsplit(unlist(lapply(as.character(rules.df$rules),function(x) substr(x,2,nchar(x)))), "} =>", fixed = T), function(x) x[1]))
      ,","))))

thanks to @acrun solution using

cSplit(rules.df.lhs, "lhs", ",", "long"))

I benchmarked 19 seconds for 1M rows (result was around 2M rows)


Solution

  • We can use cSplit from splitstackshape

    library(splitstackshape)
    cSplit(df, "rules", ",", "long")
    #   rule.id   rules
    #1:       1 Lamp1.1
    #2:       1 Lamp1.2
    #3:       2 Lamp2.1
    #4:       2 Lamp2.2
    

    If this is a huge dataset, we can use stringi to split

    library(stringi)
    lst <- stri_split_fixed(df$rules, ",")
    df2 <- data.frame(rule.id = rep(df$rule.id, lengths(lst)),
                      rules = unlist(lst))
    df2
    #   rule.id   rules
    #1       1 Lamp1.1
    #2       1 Lamp1.2
    #3       2 Lamp2.1
    #4       2 Lamp2.2
    

    Another option is data.table

    library(data.table)
    setDT(df)[, strsplit(as.character(rules), ","), by = rule.id]