I have a dataframe with approx. 26,000 entries. The dataframe has one ID column with IDs separated by ";" and multiple value columns also separated by ";". It looks something like this:
df <- data.frame (ID = c("sample1;sample2;sample3", "sample3", "sample3;sample4;sample5;sample6"),
value_1 = c("10;20;30", "30", "30;40;50;60"),
value_2 = c("130", "130", "130"))
The problem is that the number and the position of the ID entries and the corresponding values are constant within a row, but they are different across the rows. In addition, there are columns (e.g. value_2) in which entries should remain constant. Is there an easy way to reduce the complexity of the data set? Thanks
We can use separate_rows
library(dplyr)
library(tidyr)
df %>%
separate_rows(ID, value_1, convert = TRUE)
-output
# A tibble: 8 x 3
# ID value_1 value_2
# <chr> <int> <chr>
#1 sample1 10 130
#2 sample2 20 130
#3 sample3 30 130
#4 sample3 30 130
#5 sample3 30 130
#6 sample4 40 130
#7 sample5 50 130
#8 sample6 60 130
Or using cSplit
library(splitstackshape)
cSplit(df, c("ID", "value_1"), ";", "long")