apache-spark parsing apache-spark-sql univocity

Getting com.univocity.parsers.common.TextParsingException while loading a csv file

I'm trying to join a tsv dataset which has a lot of new lines in the data to another dataframe and keep getting

com.univocity.parsers.common.TextParsingException

I've already cleaned my data to replace \N with NAs as I thought that could be the reason but to no success.

The error points me to the following record in the faulty data

tt0100054 2 Повелитель мух SUHH ru NA NA 0

The stacktrace is as follows

    19/03/02 17:45:42 ERROR Executor: Exception in task 0.0 in stage 10.0 (TID 10)
com.univocity.parsers.common.TextParsingException: Length of parsed input (1000001) exceeds the maximum number of characters defined in your parser settings (1000000). 
Identified line separator characters in the parsed content. This may be the cause of the error. The line separator in your parser settings is set to '\n'. Parsed content:
    Sesso e il poliziotto sposato   IT  NA  NA  NA  0[\n]
    tt0097089   4   Sex and the Married Detective   US  NA  NA  NA  0[\n]`tt0100054 1   Fluenes herre   NO  NA  imdbDisplay NA  0
tt0100054   20  Kärpästen herra FI  NA  NA  NA  0
tt0100054   2
    at com.univocity.parsers.common.AbstractParser.handleException(AbstractParser.java:302)
    at com.univocity.parsers.common.AbstractParser.parseNext(AbstractParser.java:431)
    at org.apache.spark.sql.execution.datasources.csv.BulkCsvReader.next(CSVParser.scala:148)
    at org.apache.spark.sql.execution.datasources.csv.BulkCsvReader.next(CSVParser.scala:131)
    at scala.collection.Iterator$$anon$12.nextCur(Iterator.scala:434)
    at scala.collection.Iterator$$anon$12.hasNext(Iterator.scala:440)
    at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:408)
    at org.apache.spark.sql.execution.datasources.FileScanRDD$$anon$1.hasNext(FileScanRDD.scala:91)
    at org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIterator.processNext(Unknown Source)
    at org.apache.spark.sql.execution.BufferedRowIterator.hasNext(BufferedRowIterator.java:43)
    at org.apache.spark.sql.execution.WholeStageCodegenExec$$anonfun$8$$anon$1.hasNext(WholeStageCodegenExec.scala:370)
    at org.apache.spark.sql.execution.SparkPlan$$anonfun$4.apply(SparkPlan.scala:246)
    at org.apache.spark.sql.execution.SparkPlan$$anonfun$4.apply(SparkPlan.scala:240)
    at org.apache.spark.rdd.RDD$$anonfun$mapPartitionsInternal$1$$anonfun$apply$24.apply(RDD.scala:803)
    at org.apache.spark.rdd.RDD$$anonfun$mapPartitionsInternal$1$$anonfun$apply$24.apply(RDD.scala:803)
    at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:38)
    at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:319)
    at org.apache.spark.rdd.RDD.iterator(RDD.scala:283)
    at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:70)
    at org.apache.spark.scheduler.Task.run(Task.scala:86)
    at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:274)
    at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
    at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
    at java.lang.Thread.run(Thread.java:745)
Caused by: java.lang.ArrayIndexOutOfBoundsException: 1000000
    at com.univocity.parsers.common.input.AbstractCharInputReader.appendUtilAnyEscape(AbstractCharInputReader.java:331)
    at com.univocity.parsers.csv.CsvParser.parseQuotedValue(CsvParser.java:246)
    at com.univocity.parsers.csv.CsvParser.parseRecord(CsvParser.java:119)
    at com.univocity.parsers.common.AbstractParser.parseNext(AbstractParser.java:400)
    ... 22 more

I've already tried setting the following in the csv option("maxCharsPerCol","110000000") .option("multiLine","true"), it doesn't help. I'd appreciate any help fixing this.

I'm using spark 2.0.2 & scala 2.11.8.

Solution

Author of univocity-parsers here.

The parser was built to fail fast when something is potentially wrong with either your program (i.e. the file format was not configured correctly) or the input file (i.e. the input file doesn't have the format your program expects, or has unescaped/unclosed quotes).

The stack trace shows this:

Sesso e il poliziotto sposato   IT  NA  NA  NA  0[\n]
tt0097089   4   Sex and the Married Detective   US  NA  NA  NA  0[\n]`tt0100054 1   Fluenes herre   NO  NA  imdbDisplay NA  0
tt0100054   20  Kärpästen herra FI  NA  NA  NA  0
tt0100054   2

Which clearly shows the content of multiple rows being read as if they were part of a single value. This means that somewhere around this text in your input file there are values starting with a quote that is never not closed.

You can configure the parser to not try to handle quoted values with this:

settings.getFormat().setQuote('\0');

If you are sure your format configuration is correct and that there are very long values in the input, set maxCharsPerColumn to -1.

Lastly, it looks like you are parsing TSV, which is not CSV and should be processed differently. If that's the case, you can also try to use the TsvParser instead.

Hope this helps