I am trying to read in a .DAT file that is supposed to be tab delimited, however it is a bit messier than anticipated. The data should have five columns:
c("12315\t01/01/1999\t22:31\tOther, specify - Test\tBILE TUCT",
"UNKNOWN",
"CONTRAINDICATION, STOP",
"75\t01/28/2021\t19:34\tct\tBilateral (unable to cannulate), normal",
"75\t07/01/2014\t15:01:02\tCT/MRI\tCT chest shows collapse...otherwise OKAY. ETT okay. ",
"CT neg. Left s w/1mm cyst.",
"75\t06/13/2018\t14:30\tCardiac cath\tNormal A, EF 66%, no AS/MR"
)
I pulled a piece of the file from readLines
.
Ideally, the output would be:
V1 | V2 | V3 | V4 | V5 |
---|---|---|---|---|
12315 | 01/01/1999 | 22:31 | Other, specify - Test | BILE TUCT, UNKNOWN, CONTRAINDICATION, STOP |
75 | 01/28/2021 | 19:34 | ct | Bilateral (unable to cannulate), normal |
75 | 07/01/2014 | 15:01:02 | CT/MRI | CT chest shows collapse...otherwise OKAY. ETT okay. CT neg. Left s w/1mm cyst. |
75 | 06/13/2018 | 14:30 | Cardiac cath | Normal A, EF 66%, no AS/MR |
Assuming that tab is not used in free text fields and dat
being the result of readLines()
/ readr::read_lines()
, we could
grepl("\t", lines) |> cumsum()
for grouping, to collect consecutive related lines (one with tabs optionally followed by those without tabs)", "
;readr::read_tsv()
library(dplyr, warn.conflicts = FALSE)
dat |>
tibble(lines = _) |>
group_by(record = grepl("\t", lines) |> cumsum()) |>
#> lines record
#> <chr> <int>
#> 1 "12315\t01/01/1999\t22:31\tOther, specify - Test\tBILE TUCT" 1
#> 2 "UNKNOWN" 1
#> 3 "CONTRAINDICATION, STOP" 1
#> 4 "75\t01/28/2021\t19:34\tct\tBilateral (unable to cannulate), normal" 2
#> 5 "75\t07/01/2014\t15:01:02\tCT/MRI\tCT chest shows collapse...otherwise… 3
#> 6 "CT neg. Left s w/1mm cyst." 3
#> 7 "75\t06/13/2018\t14:30\tCardiac cath\tNormal A, EF 66%, no AS/MR" 4
summarise(lines = paste0(lines, collapse = ", ")) |>
#> record lines
#> <int> <chr>
#> 1 1 "12315\t01/01/1999\t22:31\tOther, specify - Test\tBILE TUCT, UNKNOWN, …
#> 2 2 "75\t01/28/2021\t19:34\tct\tBilateral (unable to cannulate), normal"
#> 3 3 "75\t07/01/2014\t15:01:02\tCT/MRI\tCT chest shows collapse...otherwise…
#> 4 4 "75\t06/13/2018\t14:30\tCardiac cath\tNormal A, EF 66%, no AS/MR"
pull(lines) |>
I() |>
readr::read_tsv(col_names = FALSE)
#> # A tibble: 4 × 5
#> X1 X2 X3 X4 X5
#> <dbl> <chr> <time> <chr> <chr>
#> 1 12315 01/01/1999 22:31:00 Other, specify - Test BILE TUCT, UNKNOWN, CONTRAIND…
#> 2 75 01/28/2021 19:34:00 ct Bilateral (unable to cannulat…
#> 3 75 07/01/2014 15:01:02 CT/MRI CT chest shows collapse...oth…
#> 4 75 06/13/2018 14:30:00 Cardiac cath Normal A, EF 66%, no AS/MR
Example data:
dat <- c("12315\t01/01/1999\t22:31\tOther, specify - Test\tBILE TUCT",
"UNKNOWN",
"CONTRAINDICATION, STOP",
"75\t01/28/2021\t19:34\tct\tBilateral (unable to cannulate), normal",
"75\t07/01/2014\t15:01:02\tCT/MRI\tCT chest shows collapse...otherwise OKAY. ETT okay. ",
"CT neg. Left s w/1mm cyst.",
"75\t06/13/2018\t14:30\tCardiac cath\tNormal A, EF 66%, no AS/MR"
)