I am trying to parse this csv file using univocity csv parser with the following options
options :
HEADER -> true
DELIMITERS -> ,
MULTILINE -> true
DEFAULT_TIME_STAMP -> yyyy/MM/dd HH:mm:ss ZZ
IGNORE_TRAILING_WHITE_SPACE -> false
IGNORE_LEADING_WHITE_SPACE -> false
TIME_ZONE -> Asia/Kolkata
COLUMN_PRUNING -> true
ESCAPE -> "\""
val csvOptionsObject = new CSVOptions(readerOptions, COLUMN_PRUNING , TIME_ZONE)
val parserInstance = csvOptionsObject.asParserSettings
parserInstance.setLineSeparatorDetectionEnabled(true)
val parserObject = new CsvParser(parserInstance)
val readerStream = parserObject.beginParsing(dataObj.getInputStream, csvOptionsObject.charset)
val row = parserObject.parseNext()
The file has 30 columns but when I parse it is showing as 2302 rows.
The file has \r as line separator. (as I can see it while parsing)
Setting the line separator explicitly as \r resolves this issue.
But I will also have files with \n as separator that is replaced as \r.
To resolve this setting setNormalizedLineEndingWithinQuotes(false)
resolves , but fails with other files where all the values are quoted.(Again fails to detect the separator).
Any possible work around ?
Using the asParserSettings from spark CSVOptions by default changed the inputBufferSize to 128 (default value in spark) , whereas the default value in univocity csv parser is 1048576.
Adding this code solved the porblem,
val csvOptionsObject = new CSVOptions(readerOptions, COLUMN_PRUNING , TIME_ZONE)
val parserInstance = csvOptionsObject.asParserSettings
parserInstance.setLineSeparatorDetectionEnabled(true)
parserInstance.setInputBufferSize(1048576) // Setting it to match univocity parser's default value
val parserObject = new CsvParser(parserInstance)
val readerStream = parserObject.beginParsing(dataObj.getInputStream, csvOptionsObject.charset)
val row = parserObject.parseNext()