Search code examples
rcsvreadr

csv file not read correctly (almost half of the rows are removed)


I want to read in a csv file with R. It has a semicolon as the separator and uses double quotes as quotes.

However, the file has some problematic cells (e.g. some have one double quote in text cells or an semicolon in the text) which leads to R not reading the file in correctly.

Since I can't upload a file, here's how the csv looks like when opened with a text editor:

"X1";"X2";"X3"
"1";"0";"test"
"37129";"11746; weird";"weird "thing""
"27929";"1";"xyz Limited"

Row 1 contain the headlines. In row 3 you can see that the X2 value has a semicolon after "11746" and the X3 value has an additional double quote before "thing".

So when reading this file in with readr::read_csv2(file = my_file.csv) in this small example it looks ok, but it actually isn't (note that my real file has 13k rows). For example, we'd expect to see double quotes before and after "thing".

When I use readr::read_csv2(file = my_file.csv, quote = "") instead, row 2's values are shifted due to the semicolon.

Any idea, how I could solve this issue?

The expected output would be:

# A tibble: 3 × 3
     X1 X2           X3             
  <dbl> <chr>        <chr>          
1     1 0            "test"         
2 37129 11746; weird "\"weird thing\""
3 27929 1            "xyz Limited"

Solution

  • That is not standards-compliant semi-colon-CSV. The "thing" quotes should be doubled to be escaped, as in "37129";"11746; weird";"weird ""thing""". Whatever is creating that file is not adhering to the standard.

    We can try to use regex to parse the contents, changing the embedded " quotes to double "" (and not changing the enclosing "s), perhaps this:

    ## on the shell, not in R
    $ sed -E 's/([^;])"([^;])/\1""\2/g' quux.csv > newquux.csv
    
    ## in R
    read.csv2("newquux.csv")
    #      X1           X2            X3
    # 1     1            0          test
    # 2 37129 11746; weird weird "thing"
    # 3 27929            1   xyz Limited
    

    One might be tempted to do this in R instead, which is generally fine. However, if the csv file is huge, then this will cause an unnecessary growth in the R's memory consumption ... again, only an issue if your file is huge.

    read.csv2(
      text = gsub('(?<=[^;])"(?=[^;])', '""',
                  readLines("quux.csv"), perl = TRUE))
    #      X1           X2            X3
    # 1     1            0          test
    # 2 37129 11746; weird weird "thing"
    # 3 27929            1   xyz Limited
    

    Both of the above work just as well with readr::read_csv2:

    readr::read_csv2("newquux.csv")
    readr::read_csv2(
      I(gsub('(?<=[^;])"(?=[^;])', '""',
             readLines("quux.csv"), perl = TRUE)))
    

    (The I(..) is necessary here since otherwise it opts to look for a file named "X1";"X2";"X3".)