I would like to import a very large text file as a dataframe into R. The file is produced by the "National Institute for Health and Disability Insurance" in Belgium. It contains individual data on health professionals in Belgium.
It's a fixed width text file, and I usualy import that kind of data with "read_fwf". The problem is that the structure of the file is a bit tricky :
Here's a fictive exemple of the structure, very simplified. The "prefixed part" is the 0000xxx part. The 5 first characters indicates the individual. The 2 last characters of the "prefixed part" specifies the "type" of the row. Each row contains data specific to that "type", divided into variables (the width structure of the variables is indicated in comment) :
0000111abaabacd
0000112abbbba
0000113ccaxyzzghj5
0000211acatbacz
0000212zbabba
0000311tyyyuacd
0000312xbbiop
0000411pkggbacz
#type 11 has 3 variables. Width for each : 2,3,3
#type 12 has 2 variables. Width for each : 3,3
#type 13 has 4 variables. Width for each : 3,3,1,4
So, what i'd like to do is to import into a dataframe in R each individual into one unique row, with all the variables of each type into separated columns. If the variables are missing (because there is no type/row), i'd like indicate empty data (NA) for each variable.
id var1 var2 var3 var4 var5 var6 var7 var8 var9
00001 ab aab acd abb bba cca xyz z ghj5
00002 ac atb acz zba bba NA NA NA NA
00003 ty yyu acd xbb iop NA NA NA NA
00004 pk ggb acz NA NA NA NA NA NA
I'm a bit lost, I don't know where to start, because the structure is really not usual for me. Do you have ideas or advices ?
Thank you very much !
Assuming the input shown reproducibly in the Note at the end read it as 3 fields and then reshape to wide form with one column per type. Finally separate the fields in each such column.
library(dplyr)
library(readr)
library(tidyr)
cols <- fwf_cols(id = 5, type = 2, misc = NA)
# DF <- read_fwf("myfile", cols)
DF <- read_fwf(Lines, cols)
DF %>%
pivot_wider(names_from = "type", values_from = "misc") %>%
separate(`11`, c("var1", "var2", "var3"), cumsum(c(2, 3, 2))) %>%
separate(`12`, c("var4", "var5"), cumsum(c(3, 3))) %>%
separate(`13`, c("var6", "var7", "var8", "var9"), cumsum(c(3, 3, 1, 4)))
giving:
# A tibble: 4 x 10
id var1 var2 var3 var4 var5 var6 var7 var8 var9
<chr> <chr> <chr> <chr> <chr> <chr> <chr> <chr> <chr> <chr>
1 00001 ab aab ac abb bba cca xyz z ghj5
2 00002 ac atb ac zba bba <NA> <NA> <NA> <NA>
3 00003 ty yyu ac xbb iop <NA> <NA> <NA> <NA>
4 00004 pk ggb ac <NA> <NA> <NA> <NA> <NA> <NA>
Lines <- "0000111abaabacd
0000112abbbba
0000113ccaxyzzghj5
0000211acatbacz
0000212zbabba
0000311tyyyuacd
0000312xbbiop
0000411pkggbacz"