library(tidyverse)
I have this stupid csv file where someone forgot to quote the string in the last column which may contain newlines. (The 2nd row in this example doesn’t)
csv_file <- str_c(
"a,b,c\n",
"1,1,first\nrow\n",
"1,1,second row\n",
"1,1,third\nrow\n",
collapse = ""
)
Trying to read the file with read_csv()
gives rise to problems.
read_csv(csv_file)
#> Warning: One or more parsing issues, call `problems()` on your data frame for details,
#> e.g.:
#> dat <- vroom(...)
#> problems(dat)
#> Rows: 5 Columns: 3
#> ── Column specification ────────────────────────────────────────────────────────
#> Delimiter: ","
#> chr (2): a, c
#> dbl (1): b
#>
#> ℹ Use `spec()` to retrieve the full column specification for this data.
#> ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
#> # A tibble: 5 × 3
#> a b c
#> <chr> <dbl> <chr>
#> 1 1 1 first
#> 2 row NA <NA>
#> 3 1 1 second row
#> 4 1 1 third
#> 5 row NA <NA>
How can one parse the file to get the expected results?
tibble(
a = c(1, 1, 1),
b = c(1, 1, 1),
c = c("first\nrow", "second row", "third\nrow")
)
#> # A tibble: 3 × 3
#> a b c
#> <dbl> <dbl> <chr>
#> 1 1 1 "first\nrow"
#> 2 1 1 "second row"
#> 3 1 1 "third\nrow"
Created on 2023-01-06 with reprex v2.0.2
As long as the first column doesn't contain commas, one method would be to replace all newlines with a tab character, and then use gsub
to replace any tab that isn't followed by another tab before the next comma. Once you have read in the data, you can easily gsub
the tabs in the final column back into newlines:
read.csv(text = gsub('\t([[:alnum:]]+),', '\n\\1,',
gsub('\n', '\t', csv_file))) %>%
as_tibble() %>%
mutate(c = gsub('\t', '\n', c))
#> # A tibble: 3 x 3
#> a b c
#> <int> <int> <chr>
#> 1 1 1 "first\nrow"
#> 2 1 1 "second row"
#> 3 1 1 "third\nrow\n"
Once you have read the data in you can easily gsub
the tabs in the final column into newlines
Or alternatively with stringr
csv_file |>
str_replace_all(
regex("\n(\\w+)(?=\n)", multiline = TRUE), "|\\1"
) |>
read_csv() |>
mutate(
c = str_replace(c, "[|]", "\n")
)