I am looking for a dplyr
way to break variable into multiple columns according to dictionary:
vardic <- data.frame(varname=c('a','b','c','d'),
length=c(2,6,3,1) ) %>%
mutate(end=cumsum(length),start=end-length+1)
d <- data.frame(orig_string=c('11333333444A',
'22444444111C',
'55666666000B'))
The desired output is:
d2 <- data.frame(a=c(11,22,55),b=c(333333,444444,666666),c=c(444,111,000),d=c('A','C','B')
This has to be done using only dplyr commands because this will be implemented via arrow on a larger than memory dataset (asked in this other question)
UPDATE (responding to comments): functions outside dplyr
could be used, as long as supported by arrow. arrow's list of R/dplyr supported functions describes what has been implemented so far. Hopefully this pseudocode illustrates the pipeline:
library(tidyverse)
library(arrow)
d %>% write_dataset('myfile',format='parquet')
'myfile' %>% open_dataset %>%
sequence_of_arrowsupported_commands_to_split_columns
Update2: added cols indicating start
and end
position in vardic
Update3: made the arrow pipeline, above, more reproducible. then tested @akrun's solution. But separate
is not supported by arrow
OP here. Tks for all the support. All other answers are great and work well with purr + dplyr. However, these loop/map through the variables have not been implemented in arrow
yet. One solution, however, is to have the loop outside and the arrow command be repeated for eah variable:
For instance (a hard coded sequence would be):
ds <- 'file' %>% open_dataset
ds <- ds %>% mutate(a=str_sub(orig_string,1,2))
ds <- ds %>% mutate(b=str_sub(orig_string,3,8))
...
ds %>% collect
Now reimplement this as a function + a loop:
extract_var_arrow <- function(ds,var){
s <- vardic[varname==var]$start
e <- vardic[varname==var]$end
ds %>% mutate("{var}" := str_sub(orig_string,s,e)) %>% return
}
for(v in vardic$varname){
ds <- ds %>% extract_var_arrow(v)
}
Note that, until it sees a collect
statement arrow, is just compiling a query. So the above loop is equivalent to:
# ds <- ds %>% extract_var_arrow('a')
# %>% extract_var_arrow('b')
# %>% extract_var_arrow('c')
# %>% extract_var_arrow('d')
Finally we can collect
ds %>% select(-orig_string) %>% collect
a b c d
1 11 333333 444 A
2 22 444444 111 C
3 55 666666 000 B