Search code examples
rdplyrtidyrread.fwf

R separate lines into columns specified by start and end


I'd like to split a dataset made of character strings into columns specified by start and end.

My dataset looks something like this:

>head(templines,3)
[1] "201801 1  78"
[2] "201801 2  67"
[3] "201801 1  13"

and i'd like to split it by specifying my columns using the data dictionary:

>dictionary
col_name col_start col_end  
year      1         4  
week      5         6  
gender    8         8  
age       11        12  

so it becomes:

year    week    gender    age
2018    01      1         78
2018    01      2         67
2018    01      1         13

In reality the data comes from a long running survey and the white spaces between some columns represent variables that are no longer collected. It has many variables so i need a solution that would scale.

In tidyr::separate it looks like you can only split by specifying the position to split at, rather than the start and end positions. Is there a way to use start / end?

I thought of doing this with read_fwf but I can't seem to be able to use it on my already loaded dataset. I only managed to get it to work by first exporting as a txt and then reading from this .txt:

write_lines(templines,"t1.txt")

read_fwf("t1.txt", 
     fwf_positions(start = dictionary$col_start,
                   end = dictionary$col_end,
                   col_names = dictionary$col_name)

is it possible to use read_fwf on an already loaded dataset?


Solution

  • Answering your question directly: yes, it is possible to use read_fwf with already loaded data. The relevant part of the docs is the part about the argument file:

    Either a path to a file, a connection, or literal data (either a single string or a raw vector).
    ...
    Literal data is most useful for examples and tests. 
    It must contain at least one new line to be recognised as data (instead of a path).
    

    Thus, you can simply collapse your data and then use read_fwf:

    templines %>% 
      paste(collapse = "\n") %>% 
      read_fwf(., fwf_positions(start = dictionary$col_start,
                                end = dictionary$col_end,
                                col_names = dictionary$col_name))
    

    This should scale to multiple columns, and is fast for many rows (on my machine for 1 million rows and four columns about half a second).

    There are a few warnings regarding parsing failures, but they stem from your dictionary. If you change the last line to age, 11, 12 it works as expected.