Search code examples
rdplyrapache-arrow

`dplyr` way to break variable into multiple columns acording to layout-dictionary (with varname and length)


I am looking for a dplyr way to break variable into multiple columns according to dictionary:

vardic <- data.frame(varname=c('a','b','c','d'),
                     length=c(2,6,3,1) ) %>% 
               mutate(end=cumsum(length),start=end-length+1)

d <- data.frame(orig_string=c('11333333444A',
                              '22444444111C',
                              '55666666000B'))

The desired output is:

d2 <- data.frame(a=c(11,22,55),b=c(333333,444444,666666),c=c(444,111,000),d=c('A','C','B')

This has to be done using only dplyr commands because this will be implemented via arrow on a larger than memory dataset (asked in this other question)

UPDATE (responding to comments): functions outside dplyr could be used, as long as supported by arrow. arrow's list of R/dplyr supported functions describes what has been implemented so far. Hopefully this pseudocode illustrates the pipeline:

library(tidyverse)
library(arrow)
d %>% write_dataset('myfile',format='parquet')
'myfile' %>% open_dataset %>% 
    sequence_of_arrowsupported_commands_to_split_columns

Update2: added cols indicating start and end position in vardic

Update3: made the arrow pipeline, above, more reproducible. then tested @akrun's solution. But separate is not supported by arrow


Solution

  • OP here. Tks for all the support. All other answers are great and work well with purr + dplyr. However, these loop/map through the variables have not been implemented in arrow yet. One solution, however, is to have the loop outside and the arrow command be repeated for eah variable:

    For instance (a hard coded sequence would be):

    ds <- 'file' %>% open_dataset 
    ds <- ds %>% mutate(a=str_sub(orig_string,1,2))
    ds <- ds %>% mutate(b=str_sub(orig_string,3,8))
    ...
    ds %>% collect
    

    Now reimplement this as a function + a loop:

    extract_var_arrow <- function(ds,var){
      s <- vardic[varname==var]$start
      e <- vardic[varname==var]$end
      ds %>% mutate("{var}" := str_sub(orig_string,s,e)) %>% return
    }
    for(v in vardic$varname){
      ds <- ds %>% extract_var_arrow(v)
    }
    

    Note that, until it sees a collect statement arrow, is just compiling a query. So the above loop is equivalent to:

       # ds <- ds %>% extract_var_arrow('a') 
       #             %>% extract_var_arrow('b')
       #                %>% extract_var_arrow('c')
       #                   %>% extract_var_arrow('d')
    

    Finally we can collect

    ds %>% select(-orig_string) %>% collect
    
       a      b   c d
    1 11 333333 444 A
    2 22 444444 111 C
    3 55 666666 000 B