Search code examples
rdplyrstrsplit

Separate string into new columns of unknown number


I have a data set that looks like this:

data = tibble(emp = c(1:4), 
                  idstring = c("PER20384|PER49576|PER10837|PER92641",
                          "PER20384|PER49576|PER03875|PER72534", 
                          "PER20384|PER98642|PER17134", 
                          "PER20384|PER98623|PER17134|PER01836|PER1234"))

I would like to separate idstring by "|" into individual columns. However, I need the rightmost characters (e.g., "PER92641") to always be in a column labeled "Level_1" and the leftmost characters to vary based on the number of characters in the row.

I have tried some basic steps like:

data_split = str_split(data$idstring, "\\|", simplify = T)
colnames(data_split) = paste0("Level_", ncol(data_split):1)

But I get incorrect output like this this:

  Level_5    Level_4    Level_3    Level_2    Level_1  
[1,] "PER20384" "PER49576" "PER10837" "PER92641" ""       
[2,] "PER20384" "PER49576" "PER03875" "PER72534" ""       
[3,] "PER20384" "PER98642" "PER17134" ""         ""       
[4,] "PER20384" "PER98623" "PER17134" "PER01836" "PER1234" 

Where it should look like this:

   Level_5      Level_4    Level_3    Level_2    Level_1  
[1,]   NA       "PER20384" "PER49576" "PER10837" "PER92641"        
[2,]   NA       "PER20384" "PER49576" "PER03875" "PER72534"        
[3,]   NA         NA       "PER20384" "PER98642" "PER17134"       
[4,] "PER20384" "PER98623" "PER17134" "PER01836" "PER1234"

Notice that I would also ideally like to have NAs in place of empty spaces where applicable.

I get the sense that I could somehow reverse the order of each row, then replace spaces with NAs, before adding the colnames but I'm hoping there is a more elegant solution to be found here.


Solution

  • This can be done by doing an order on the NA values. We split the 'idstring' at | into a list, get the max lengths of the list elements ('mx'). Use that to pad NA with length<- (by default it pad at the end instead of the beginning), then we order the vector based on the NA elements, rbind the list elements

    lst1 <- strsplit(data$idstring, "|", fixed = TRUE)
    mx <- max(lengths(lst1))
    out <- do.call(rbind,  lapply(lst1, function(x) {
           length(x) <- mx
            x[order(!is.na(x))]
      }))
    colnames(out) <- paste0("Level_", ncol(out):1)
    

    -output

    #    Level_5    Level_4    Level_3    Level_2    Level_1   
    #[1,] NA         "PER20384" "PER49576" "PER10837" "PER92641"
    #[2,] NA         "PER20384" "PER49576" "PER03875" "PER72534"
    #[3,] NA         NA         "PER20384" "PER98642" "PER17134"
    #[4,] "PER20384" "PER98623" "PER17134" "PER01836" "PER1234" 
    

    Or another option is to use read.table to read the column and and then modify the row values by rearranging the NA elements at the front

    d1 <- read.table(text = data$idstring, sep="|", header = FALSE, 
        fill = TRUE, na.strings = c(""), col.names = paste0('Level_',  5:1))
    
    d1[1] <- t(apply(d1, 1, function(x) c(x[is.na(x)], x[!is.na(x)])))