Search code examples
javacsvparsingunivocity

Univocity Parser: TextParsingException while parsing a line which has a starting double quote(") but does not have an ending double quote(")


Getting exception while parsing file:

com.univocity.parsers.common.TextParsingException: Length of parsed input (4097) exceeds the maximum number of characters defined in your parser settings (4096). 
Identified line separator characters in the parsed content. This may be the cause of the error. The line separator in your parser settings is set to '\r\n'. Parsed content: The quick brown fox jumps over the lazy dog.|[\n]

File Content:

1234|5678|The quick brown fox jumps over the lazy dog.|
1234|5678|"The quick brown fox jumps over the lazy dog.|
1234|5678|The quick brown fox jumps over the lazy dog.|
1234|5678|The quick brown fox jumps over the lazy dog.|
1234|5678|The quick brown fox jumps over the lazy dog.|
.........
.........
1234|5678|The quick brown fox jumps over the lazy dog.|

I'm using the following CSV Parser settings:

CsvParserSettings parserSettings = new CsvParserSettings();
parserSettings.setLineSeparatorDetectionEnabled(true);
parserSettings.getFormat().setDelimiter('|');
parserSettings.setIgnoreLeadingWhitespaces(true);
parserSettings.setIgnoreTrailingWhitespaces(true);
parserSettings.setHeaderExtractionEnabled(false);
parserSettings.setMaxCharsPerColumn(4096);

What I can infer from the exception is that in the second line I have a starting double quote ("). But the line does not ends with the double quote ("). So in this case the column length reaches till EOF(end of file).

Tested with build: 2.2.2


Solution

  • That's how the CSV parser is supposed to work. If a quote is found it is because the content after the quote can contain delimiters, line endings or other (hopefully) escaped quotes.

    The only way to work around this situation in your case is to do something like this:

    parserSettings.getFormat().setQuote('\0');
    

    This will make the parser just ignore quotes and process values with them as unquoted values. Once a line ending or delimiter is found, the value will be collected as you expect.