Search code examples
rcsvduckdb

Quote should be followed end of value, end of row or another quote?


I'm trying to read a 20M records tsv in R, but keep getting errors because aparently the file was not generated correctly.

However, the data is always uploaded to Adobe's servers, for a product we use. It gives warning about values not been correctly formated, but nonetheless it gets uploaded and we can use the data.

1.- How can I upload the data to duckdb regardless of the formating issues?

2.- Error in file "xxx,csv" on line 1107386. How can I access only that line (or maybe 5 before and after) to see what is really the problem?

enter image description here


Solution

  • EDIT: long story short, duckdb_read_csv needs the quote="" just like read.csv does in the previous answer (preserved, below). You may still need to consider some of the reparative steps for removing other quotes that may now be retained by this.

    Reprex:

    duck <- DBI::dbConnect(duckdb::duckdb())
    writeLines(c('a,b,d', '1,"Boticas MiCasa" RSL,3'), "fake.csv")
    duckdb::duckdb_read_csv(duck, "faketable", "fake.csv")
    # Error: rapi_execute: Failed to run query
    # Error: Invalid Input Error: Error in file "/home/r2/fake.csv" on line 2: quote should be followed by end of value, end of row or another quote. (  file=/home/r2/fake.csv
    #   delimiter=','
    #   quote='"'
    #   escape='"' (default)
    #   header=1
    #   sample_size=20480
    #   ignore_errors=0
    #   all_varchar=0). 
    duckdb::duckdb_read_csv(duck, "faketable", "fake.csv", quote="")
    DBI::dbGetQuery(duck, "select * from faketable")
    #   a                    b d
    # 1 1 "Boticas MiCasa" RSL 3
    

    Just a thought to help troubleshoot, not sure it'll work.

    When the error is about quotes, you can turn off quote process and try again. It will result in having quotes all throughout your data, but (1) these are easy to remove, and (2) at least you can see where the problem is.

    Unfortunately ... if the data has embedded commas within strings (which would make the use of quotes required), then this won't work.

    Demo:

    # normal
    read.csv(text='a,b,d\n"1",2,3\n"4",5,6')
    #   a b d
    # 1 1 2 3
    # 2 4 5 6
    read.csv(text='a,b,d\n"1",2,3\n"4,5,6')
    # Error in read.table(file = file, header = header, sep = sep, quote = quote,  : 
    #   incomplete final line found by readTableHeader on 'text'
    

    We can disable processing of quotes by adding quote="", and see the problematic row:

    read.csv(text='a,b,d\n"1",2,3\n"4,5,6', quote="")
    #     a b d
    # 1 "1" 2 3
    # 2  "4 5 6
    

    We can clean up both the good ("1") and bad ("4) strings with this pipeline:

    read.csv(text='a,b,d\n"1",2,3\n"4,5,6', quote="") |>
      lapply(function(z) if (is.character(z)) type.convert(gsub(r"(^['"]|['"]$)", "", z), as.is=TRUE) else z) |>
      as.data.frame()
    #   a b d
    # 1 1 2 3
    # 2 4 5 6
    

    If you have embedded quotes (requiring CSV's double-quoting escape, as in "hello ""world"" again", then you'd need another step in the cleanup to work on those middle quotes.