Search code examples
scalacsvapache-sparkunivocity

Spark Univocity parser - LineSeparatorDetection not working


I am trying to parse this csv file using univocity csv parser with the following options

options :

HEADER -> true
DELIMITERS -> ,
MULTILINE -> true
DEFAULT_TIME_STAMP -> yyyy/MM/dd HH:mm:ss ZZ
IGNORE_TRAILING_WHITE_SPACE -> false
IGNORE_LEADING_WHITE_SPACE -> false
TIME_ZONE -> Asia/Kolkata
COLUMN_PRUNING -> true
ESCAPE -> "\""

val csvOptionsObject = new CSVOptions(readerOptions, COLUMN_PRUNING , TIME_ZONE)
val parserInstance = csvOptionsObject.asParserSettings
parserInstance.setLineSeparatorDetectionEnabled(true)
val parserObject = new CsvParser(parserInstance)
val readerStream = parserObject.beginParsing(dataObj.getInputStream, csvOptionsObject.charset)
val row = parserObject.parseNext()

The file has 30 columns but when I parse it is showing as 2302 rows.

The file has \r as line separator. (as I can see it while parsing)

Setting the line separator explicitly as \r resolves this issue. But I will also have files with \n as separator that is replaced as \r. To resolve this setting setNormalizedLineEndingWithinQuotes(false) resolves , but fails with other files where all the values are quoted.(Again fails to detect the separator).

Any possible work around ?


Solution

  • Using the asParserSettings from spark CSVOptions by default changed the inputBufferSize to 128 (default value in spark) , whereas the default value in univocity csv parser is 1048576.

    Adding this code solved the porblem,

    val csvOptionsObject = new CSVOptions(readerOptions, COLUMN_PRUNING , TIME_ZONE)
    val parserInstance = csvOptionsObject.asParserSettings
    parserInstance.setLineSeparatorDetectionEnabled(true)
    parserInstance.setInputBufferSize(1048576) // Setting it to match univocity parser's default value
    val parserObject = new CsvParser(parserInstance)
    val readerStream = parserObject.beginParsing(dataObj.getInputStream, csvOptionsObject.charset)
    val row = parserObject.parseNext()