Search code examples
rcsvparsingtidyversereadr

read_csv() parsing error message, how to interpret?


I am in the middle of parsing in a large amount of csv data. The data is rather "dirty" in that I have inconsistent delimiters, spurious characters and format issues that cause problems for read_csv().

My problem here, however, is not the dirtiness of the data, but just trying to understand the parsing errors that read_csv() is giving me. If I can better understand the error messages, I can then do some janitorial work to fix the problem with scripts. The size of the data makes a manual approach intractable.

Here's a minimal example. Suppose I have a csv file like this:

"col_a","col_b","col_c"
"1","a quick","10"
"2","a quick "brown" fox","20"
"3","quick, brown fox","30"

Note that there's spurious quotes around "brown" in the 2nd row. This content goes into a file called "my_data.csv".

When I try to read that file, I get some parsing failures.

> library(tidyverse)
> df <- read_csv("./my_data.csv", col_types = cols(.default = "c"))
Warning: 2 parsing failures.
row # A tibble: 2 x 5 col     row   col           expected actual            file expected   <int> <chr>              <chr>  <chr>           <chr> actual 1     2 col_b delimiter or quote      b './my_data.csv' file 2     2 col_b delimiter or quote        './my_data.csv'

As you can see, the parsing failure has not been "pretty printed". It is ONE LONG LINE of 271 characters.

I can't figure out where to even put linebreaks in the failure message to see where the problem is and what the message is trying to tell me. Moreover, it refers to a "2x5 tibble". What tibble? My data frame is 3x3.

Can someone show me how to format or put linebreaks in the message from read_csv() so I can see how it is detecting the problem?

Yes, I know what the problem is in this particular minimal example. In my actual data I am dealing with large amounts of csv (~1M rows), peppered with inconsistencies that shower me with hundreds of parsing failures. I'd like to setup a workflow for categorizing these and dealing with them programmatically. The first step, I think, is just understanding how to "parse" the parsing failure message.


Solution

  • After taking a breath and looking at the actual documentation, I see there is a way to get the parsing failures from read_csv() in a form that is very usable.

    All I had to do to get the parsing failures was to use problems().

    > library(tidyverse)
    > df <- read_csv("./my_data.csv", col_types = cols(.default = "c"))
    Warning: 2 parsing failures.
    row # A tibble: 2 x 5 col     row   col           expected actual            file expected   <int> <chr>              <chr>  <chr>           <chr> actual 1     2 col_b delimiter or quote      b './my_data.csv' file 2     2 col_b delimiter or quote        './my_data.csv'
    
    > parsing_failures <- problems(df)
    > parsing_failures
    # A tibble: 2 x 5
        row   col           expected actual            file
      <int> <chr>              <chr>  <chr>           <chr>
    1     2 col_b delimiter or quote      b './my_data.csv'
    2     2 col_b delimiter or quote        './my_data.csv'
    

    Apparently read_csv() associates a tibble containing parsing failure details and this is accessible by passing the result from read_csv to problems().