Search code examples
rimportfixed-widthvariable-length

Importing Variable length text files per individual into R


I would like to import a very large text file as a dataframe into R. The file is produced by the "National Institute for Health and Disability Insurance" in Belgium. It contains individual data on health professionals in Belgium.

It's a fixed width text file, and I usualy import that kind of data with "read_fwf". The problem is that the structure of the file is a bit tricky :

  • The individual data are divided into eight "types" ("type 11", "type 12", "type 13", etc.) that are on different rows. So each individual has several rows (= different type of information).
  • Each row begins with a "prefix part" of the same information about these individuals, which allows the data to be joined by individual. This "prefix part" is therefore repeated from row to row for the same individual.
  • Each part (each "type"/row and the "prefix part") has its own fixed width structure. For exemple, the prefix part has 54 characters, in which the type is indicated in the 53rd and 54th characters ; type 11 has 249 characters divided into 18 variables ; type 12 has 78 characters divided into 5 variables ; type 13 has 110 characters divided into 12 variables ; etc.
  • BUT, the number of "types" (and therefore the number of rows) per individual is variable. Some individuals have 3 types/row, others 4, 5, 6, 7 or 8. This means that there is a variable number of types/rows per individual.

Here's a fictive exemple of the structure, very simplified. The "prefixed part" is the 0000xxx part. The 5 first characters indicates the individual. The 2 last characters of the "prefixed part" specifies the "type" of the row. Each row contains data specific to that "type", divided into variables (the width structure of the variables is indicated in comment) :

0000111abaabacd
0000112abbbba
0000113ccaxyzzghj5
0000211acatbacz
0000212zbabba
0000311tyyyuacd
0000312xbbiop
0000411pkggbacz

#type 11 has 3 variables. Width for each : 2,3,3
#type 12 has 2 variables. Width for each : 3,3
#type 13 has 4 variables. Width for each : 3,3,1,4

So, what i'd like to do is to import into a dataframe in R each individual into one unique row, with all the variables of each type into separated columns. If the variables are missing (because there is no type/row), i'd like indicate empty data (NA) for each variable.

id     var1  var2  var3  var4  var5  var6  var7  var8  var9
00001    ab   aab   acd   abb   bba   cca   xyz     z  ghj5
00002    ac   atb   acz   zba   bba    NA    NA    NA    NA  
00003    ty   yyu   acd   xbb   iop    NA    NA    NA    NA
00004    pk   ggb   acz    NA    NA    NA    NA    NA    NA

I'm a bit lost, I don't know where to start, because the structure is really not usual for me. Do you have ideas or advices ?

Thank you very much !


Solution

  • Assuming the input shown reproducibly in the Note at the end read it as 3 fields and then reshape to wide form with one column per type. Finally separate the fields in each such column.

    library(dplyr)
    library(readr)
    library(tidyr)
    
    cols <- fwf_cols(id = 5, type = 2, misc = NA)
    
    # DF <- read_fwf("myfile", cols)
    DF <- read_fwf(Lines, cols)
    
    DF %>%
      pivot_wider(names_from = "type", values_from = "misc") %>%
      separate(`11`, c("var1", "var2", "var3"), cumsum(c(2, 3, 2))) %>%
      separate(`12`, c("var4", "var5"), cumsum(c(3, 3))) %>%
      separate(`13`, c("var6", "var7", "var8", "var9"), cumsum(c(3, 3, 1, 4)))
    

    giving:

    # A tibble: 4 x 10
      id    var1  var2  var3  var4  var5  var6  var7  var8  var9 
      <chr> <chr> <chr> <chr> <chr> <chr> <chr> <chr> <chr> <chr>
    1 00001 ab    aab   ac    abb   bba   cca   xyz   z     ghj5 
    2 00002 ac    atb   ac    zba   bba   <NA>  <NA>  <NA>  <NA> 
    3 00003 ty    yyu   ac    xbb   iop   <NA>  <NA>  <NA>  <NA> 
    4 00004 pk    ggb   ac    <NA>  <NA>  <NA>  <NA>  <NA>  <NA> 
    

    Note

    Lines <- "0000111abaabacd
    0000112abbbba
    0000113ccaxyzzghj5
    0000211acatbacz
    0000212zbabba
    0000311tyyyuacd
    0000312xbbiop
    0000411pkggbacz"