I have a data set that looks like this:
data = tibble(emp = c(1:4),
idstring = c("PER20384|PER49576|PER10837|PER92641",
"PER20384|PER49576|PER03875|PER72534",
"PER20384|PER98642|PER17134",
"PER20384|PER98623|PER17134|PER01836|PER1234"))
I would like to separate idstring by "|" into individual columns. However, I need the rightmost characters (e.g., "PER92641") to always be in a column labeled "Level_1" and the leftmost characters to vary based on the number of characters in the row.
I have tried some basic steps like:
data_split = str_split(data$idstring, "\\|", simplify = T)
colnames(data_split) = paste0("Level_", ncol(data_split):1)
But I get incorrect output like this this:
Level_5 Level_4 Level_3 Level_2 Level_1
[1,] "PER20384" "PER49576" "PER10837" "PER92641" ""
[2,] "PER20384" "PER49576" "PER03875" "PER72534" ""
[3,] "PER20384" "PER98642" "PER17134" "" ""
[4,] "PER20384" "PER98623" "PER17134" "PER01836" "PER1234"
Where it should look like this:
Level_5 Level_4 Level_3 Level_2 Level_1
[1,] NA "PER20384" "PER49576" "PER10837" "PER92641"
[2,] NA "PER20384" "PER49576" "PER03875" "PER72534"
[3,] NA NA "PER20384" "PER98642" "PER17134"
[4,] "PER20384" "PER98623" "PER17134" "PER01836" "PER1234"
Notice that I would also ideally like to have NAs in place of empty spaces where applicable.
I get the sense that I could somehow reverse the order of each row, then replace spaces with NAs, before adding the colnames but I'm hoping there is a more elegant solution to be found here.
This can be done by doing an order
on the NA values. We split the 'idstring' at |
into a list
, get the max
lengths
of the list
elements ('mx'). Use that to pad NA
with length<-
(by default it pad at the end instead of the beginning), then we order
the vector based on the NA elements, rbind
the list
elements
lst1 <- strsplit(data$idstring, "|", fixed = TRUE)
mx <- max(lengths(lst1))
out <- do.call(rbind, lapply(lst1, function(x) {
length(x) <- mx
x[order(!is.na(x))]
}))
colnames(out) <- paste0("Level_", ncol(out):1)
-output
# Level_5 Level_4 Level_3 Level_2 Level_1
#[1,] NA "PER20384" "PER49576" "PER10837" "PER92641"
#[2,] NA "PER20384" "PER49576" "PER03875" "PER72534"
#[3,] NA NA "PER20384" "PER98642" "PER17134"
#[4,] "PER20384" "PER98623" "PER17134" "PER01836" "PER1234"
Or another option is to use read.table
to read the column and and then modify the row values by rearranging the NA elements at the front
d1 <- read.table(text = data$idstring, sep="|", header = FALSE,
fill = TRUE, na.strings = c(""), col.names = paste0('Level_', 5:1))
d1[1] <- t(apply(d1, 1, function(x) c(x[is.na(x)], x[!is.na(x)])))