Search code examples
rdata.tablelapplyfixed-width

extract several variables from string position using a dictionary (varname,startposition,endposition) in data.table


I have a data.table, DT containing one string variable, s, from which I want to create additional variables based on the positions of each character in s. The varname, start and end positions and vartype of each variable are given in a dictionary, dic. Here is an example, including the desired output:

DT <- DT <- data.table(s=c('a191','b292','c393'))
dic <- data.table(varname=c('bla,ble,bli'),start=c(1,2,3),end=c(1,2,4),vartype=c('c','i','i')
DTdesired <- data.table(bla=c('a','b','c'),ble=c(1,2,3),bli=c(91,92,93))

This is the same problem as importing data from a fixed width file (fwf), which has been covered in this SO question. The answer to that post includes a data table solution

 DT[ , lapply(seq_len(length(cols$beg)), function(ii) {
      stri_sub(V1, cols$beg[ii], cols$end[ii])
    })]

But I am looking for a solution that besides start and end would also use the varname and coltype variables of the dictionary

Edit: In reality, DT would be a large Census file, around 200 million observations. Each element of s is a string with 200 characters. And the dictionary, dic, can contain about 10 to 50 variables to be extracted, depending on the year.


Solution

  • We can use Map to do the conversion. (added the updated function created by @LucasMation)

    library(magrittr)
    library(data.table)
    subtr_typeconvert <- function(x, y)
         {
     DT$s %>%
       substr(x, y) %>%
         type.convert(as.is = TRUE)
    }
    dic[, setNames(Map(function(x, y)
          subtr_typeconvert(x,y), start, end), varname)]