I am in the middle of parsing in a large amount of csv data. The data is rather "dirty" in that I have inconsistent delimiters, spurious characters and format issues that cause problems for read_csv().
My problem here, however, is not the dirtiness of the data, but just trying to understand the parsing errors that read_csv() is giving me. If I can better understand the error messages, I can then do some janitorial work to fix the problem with scripts. The size of the data makes a manual approach intractable.
Here's a minimal example. Suppose I have a csv file like this:
"col_a","col_b","col_c"
"1","a quick","10"
"2","a quick "brown" fox","20"
"3","quick, brown fox","30"
Note that there's spurious quotes around "brown" in the 2nd row. This content goes into a file called "my_data.csv".
When I try to read that file, I get some parsing failures.
> library(tidyverse)
> df <- read_csv("./my_data.csv", col_types = cols(.default = "c"))
Warning: 2 parsing failures.
row # A tibble: 2 x 5 col row col expected actual file expected <int> <chr> <chr> <chr> <chr> actual 1 2 col_b delimiter or quote b './my_data.csv' file 2 2 col_b delimiter or quote './my_data.csv'
As you can see, the parsing failure has not been "pretty printed". It is ONE LONG LINE of 271 characters.
I can't figure out where to even put linebreaks in the failure message to see where the problem is and what the message is trying to tell me. Moreover, it refers to a "2x5 tibble". What tibble? My data frame is 3x3.
Can someone show me how to format or put linebreaks in the message from read_csv() so I can see how it is detecting the problem?
Yes, I know what the problem is in this particular minimal example. In my actual data I am dealing with large amounts of csv (~1M rows), peppered with inconsistencies that shower me with hundreds of parsing failures. I'd like to setup a workflow for categorizing these and dealing with them programmatically. The first step, I think, is just understanding how to "parse" the parsing failure message.
After taking a breath and looking at the actual documentation, I see there is a way to get the parsing failures from read_csv() in a form that is very usable.
All I had to do to get the parsing failures was to use problems().
> library(tidyverse)
> df <- read_csv("./my_data.csv", col_types = cols(.default = "c"))
Warning: 2 parsing failures.
row # A tibble: 2 x 5 col row col expected actual file expected <int> <chr> <chr> <chr> <chr> actual 1 2 col_b delimiter or quote b './my_data.csv' file 2 2 col_b delimiter or quote './my_data.csv'
> parsing_failures <- problems(df)
> parsing_failures
# A tibble: 2 x 5
row col expected actual file
<int> <chr> <chr> <chr> <chr>
1 2 col_b delimiter or quote b './my_data.csv'
2 2 col_b delimiter or quote './my_data.csv'
Apparently read_csv() associates a tibble containing parsing failure details and this is accessible by passing the result from read_csv to problems().