Search code examples
javaopencsv

OpenCSV infinte loop when line has missing quote


I am reading in very large (millions of rows) CSV files that come from a remote source which I have no control over. I am using OpenCSV which is working great, until today. Today's file has a single bad line in it that looks something like

col1,col2,col3,"col4, ""stuff"" and yeah, \", col5, col6, col7...\r\n

The extra \ on the end is breaking OpenCsv so that readNext never returns. I suspect it is seeing that as an escaped quote, and there by the quoted field is not closed. If i remove the \ all is good. Put it back, it breaks again.

Since readNext never returns I do not have a good way to capture the error an intercept it.

My guess is that it is trying to load the entire rest of the file (100s of thousands of rows) in to col4 and choking.

What I would prefer is an error that I can catch, report, and move on to the next line in the file. Any idea how I can accomplish this?


Solution

  • OK - I figured out a way. Originally I was using:

    reader = new CSVReader(new FileReader(this.fullFileName), ',','"', 1);
    

    Then had a loop like so:

    while ((csvLine = reader.readNext()) != null) {
    ..do stuff..
    }
    

    That call to readNext() never returns when it hits that bad record. So there is no way to catch it. Changing the code to use CVSParser instead:

    fileLines = Files.readAllLines(new File(this.fullFileName).toPath(), Charset.forName("UTF-8"));
    CSVParser csvParser = new CSVParser(delimChar,quoteChar);
        for (String nextLine : fileLines) {
            try {
                csvLine = csvParser.parseLine(nextLine);
                ...do stuff...
            } catch (Exception ex) {
                ...report bad record and stuff..  
            }
        }
    

    Now, when that record is hit, CSVParser will throw an exception, which I can catch and do stuff with.

    The primary drawback to this is that multi-line records will not work, but in my use case that is not a problem. I do not know a solution for multi-line records.