Search code examples
rtidyr

split strings of multiple columns into separate rows using only base R


I have tried to use the answer in this post using base R: Splitting a string into new rows in R

split_dfcheck <- as.data.frame(do.call(rbind, apply(test, 1, function(x) {
  do.call(expand.grid, strsplit(x, " *; *"))
})))

however, when I use it on my example data:

test <-data.frame(name=c("Andrea;barbara;erica;Matteo","Matteo;sonia;marta","Matteo;sonia","sonia;marta;matteo" ),  code=c("A", "A", "B", "B"),number=c(1,1,2,2),
                  surname=c("surname1;surname2;surname3;surname4", "surname1;surname2;surname3", "surname1;surname2","surname1;surname2;surname3"))

I get 38 rows while using separate_rows I get 12 rows:

test1 <- test %>%
  separate_rows(c(name,surname), , sep = ";")

Is there a way to get the same result as with separate_rows but using base R?

thanks!


Solution

  • test2 <- do.call(
      rbind,
      apply(test, 1, \(x) as.data.frame(sapply(x, strsplit, ";")))
    )
    test2
    #       name code number  surname
    # 1   Andrea    A      1 surname1
    # 2  barbara    A      1 surname2
    # 3    erica    A      1 surname3
    # 4   Matteo    A      1 surname4
    # 5   Matteo    A      1 surname1
    # 6    sonia    A      1 surname2
    # 7    marta    A      1 surname3
    # 8   Matteo    B      2 surname1
    # 9    sonia    B      2 surname2
    # 10   sonia    B      2 surname1
    # 11   marta    B      2 surname2
    # 12  matteo    B      2 surname3
    

    Has the drawback of converting all columns to character though. If you only have character/numeric you could fix that with

    numeric_columns <- Filter(\(x) is.numeric(test[[x]]), names(test))
    test2[numeric_columns] <- lapply(test2[numeric_columns], as.numeric)