I'm trying to join a tsv dataset which has a lot of new lines in the data to another dataframe and keep getting
com.univocity.parsers.common.TextParsingException
I've already cleaned my data to replace \N with NAs as I thought that could be the reason but to no success.
The error points me to the following record in the faulty data
tt0100054 2 Повелитель мух SUHH ru NA NA 0
The stacktrace is as follows
19/03/02 17:45:42 ERROR Executor: Exception in task 0.0 in stage 10.0 (TID 10)
com.univocity.parsers.common.TextParsingException: Length of parsed input (1000001) exceeds the maximum number of characters defined in your parser settings (1000000).
Identified line separator characters in the parsed content. This may be the cause of the error. The line separator in your parser settings is set to '\n'. Parsed content:
Sesso e il poliziotto sposato IT NA NA NA 0[\n]
tt0097089 4 Sex and the Married Detective US NA NA NA 0[\n]`tt0100054 1 Fluenes herre NO NA imdbDisplay NA 0
tt0100054 20 Kärpästen herra FI NA NA NA 0
tt0100054 2
at com.univocity.parsers.common.AbstractParser.handleException(AbstractParser.java:302)
at com.univocity.parsers.common.AbstractParser.parseNext(AbstractParser.java:431)
at org.apache.spark.sql.execution.datasources.csv.BulkCsvReader.next(CSVParser.scala:148)
at org.apache.spark.sql.execution.datasources.csv.BulkCsvReader.next(CSVParser.scala:131)
at scala.collection.Iterator$$anon$12.nextCur(Iterator.scala:434)
at scala.collection.Iterator$$anon$12.hasNext(Iterator.scala:440)
at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:408)
at org.apache.spark.sql.execution.datasources.FileScanRDD$$anon$1.hasNext(FileScanRDD.scala:91)
at org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIterator.processNext(Unknown Source)
at org.apache.spark.sql.execution.BufferedRowIterator.hasNext(BufferedRowIterator.java:43)
at org.apache.spark.sql.execution.WholeStageCodegenExec$$anonfun$8$$anon$1.hasNext(WholeStageCodegenExec.scala:370)
at org.apache.spark.sql.execution.SparkPlan$$anonfun$4.apply(SparkPlan.scala:246)
at org.apache.spark.sql.execution.SparkPlan$$anonfun$4.apply(SparkPlan.scala:240)
at org.apache.spark.rdd.RDD$$anonfun$mapPartitionsInternal$1$$anonfun$apply$24.apply(RDD.scala:803)
at org.apache.spark.rdd.RDD$$anonfun$mapPartitionsInternal$1$$anonfun$apply$24.apply(RDD.scala:803)
at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:38)
at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:319)
at org.apache.spark.rdd.RDD.iterator(RDD.scala:283)
at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:70)
at org.apache.spark.scheduler.Task.run(Task.scala:86)
at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:274)
at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
at java.lang.Thread.run(Thread.java:745)
Caused by: java.lang.ArrayIndexOutOfBoundsException: 1000000
at com.univocity.parsers.common.input.AbstractCharInputReader.appendUtilAnyEscape(AbstractCharInputReader.java:331)
at com.univocity.parsers.csv.CsvParser.parseQuotedValue(CsvParser.java:246)
at com.univocity.parsers.csv.CsvParser.parseRecord(CsvParser.java:119)
at com.univocity.parsers.common.AbstractParser.parseNext(AbstractParser.java:400)
... 22 more
I've already tried setting the following in the csv option("maxCharsPerCol","110000000") .option("multiLine","true"), it doesn't help. I'd appreciate any help fixing this.
I'm using spark 2.0.2 & scala 2.11.8.
Author of univocity-parsers
here.
The parser was built to fail fast when something is potentially wrong with either your program (i.e. the file format was not configured correctly) or the input file (i.e. the input file doesn't have the format your program expects, or has unescaped/unclosed quotes).
The stack trace shows this:
Sesso e il poliziotto sposato IT NA NA NA 0[\n]
tt0097089 4 Sex and the Married Detective US NA NA NA 0[\n]`tt0100054 1 Fluenes herre NO NA imdbDisplay NA 0
tt0100054 20 Kärpästen herra FI NA NA NA 0
tt0100054 2
Which clearly shows the content of multiple rows being read as if they were part of a single value. This means that somewhere around this text in your input file there are values starting with a quote that is never not closed.
You can configure the parser to not try to handle quoted values with this:
settings.getFormat().setQuote('\0');
If you are sure your format configuration is correct and that there are very long values in the input, set maxCharsPerColumn
to -1
.
Lastly, it looks like you are parsing TSV, which is not CSV and should be processed differently. If that's the case, you can also try to use the TsvParser
instead.
Hope this helps