I wrote a simple code to parse a large XML file ( extract lines, clean text, and remove any html tags from it) using Apache Spark.
I'm seeing a NullPointerException when calling .replaceAllIn
on a string, which is non-null.
The funny thing is that I have no errors when I run the code locally, using input from disk, but I get a NullPointerException
when I run the same code on AWS EMR, loading the input file from S3.
Here is the relevant code:
val HTML_TAGS_PATTERN = """<[^>]+>""".r
// other code here...
spark
.sparkContext
.textFile(pathToInputFile, numPartitions)
.filter { str => str.startsWith(" <row ") }
.toDS()
.map { str =>
Locale.setDefault(new Locale("en", "US"))
val parts = str.split(""""""")
var title: String = ""
var body: String = ""
// some code ommitted here
title = StringEscapeUtils.unescapeXml(title).toLowerCase.trim
body = StringEscapeUtils.unescapeXml(body).toLowerCase // decode xml entities
println("before replacing, body is: "+body)
// NEXT LINE TRIGGERS NPE
body = HTML_TAGS_PATTERN.replaceAllIn(body, " ") // take out htmltags
}
Things I've tried:
printing the string just before calling replaceAllIn
to make sure it's not null
.
making sure the Locale is not null
printing out the exception message, and stacktrace: it just tells me that that line is where the NullPointerException occurs. Nothing more
Things that are different between my local setup and AWS EMR:
in my local setup, I load the input file from disk, on EMR I load it from s3.
in my local setup, I run Spark in standalone mode, on EMR it's run in cluster mode.
Everything else is the same on my machine and on AWS EMR: Scala version, Spark version, Java version, Cluster configs...
I have been trying to figure this out for some hours and I can't think of anything else to try.
I've moved the call to r()
to within the map{}
body, like this:
val HTML_TAGS_PATTERN = """<[^>]+>"""
// code ommited
.map{
body = HTML_TAGS_PATTERN.r.replaceAllIn(body, " ")
}
This also produces a NPE, wit the following stracktrace:
java.lang.NullPointerException
at java.util.regex.Pattern.<init>(Pattern.java:1350)
at java.util.regex.Pattern.compile(Pattern.java:1028)
at scala.util.matching.Regex.<init>(Regex.scala:191)
at scala.collection.immutable.StringLike$class.r(StringLike.scala:255)
at scala.collection.immutable.StringOps.r(StringOps.scala:29)
at scala.collection.immutable.StringLike$class.r(StringLike.scala:244)
at scala.collection.immutable.StringOps.r(StringOps.scala:29)
at ReadSOStanfordTokenize$$anonfun$2.apply(ReadSOStanfordTokenize.scala:102)
at ReadSOStanfordTokenize$$anonfun$2.apply(ReadSOStanfordTokenize.scala:72)
at org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIterator.processNext(Unknown Source)
at org.apache.spark.sql.execution.BufferedRowIterator.hasNext(BufferedRowIterator.java:43)
at org.apache.spark.sql.execution.WholeStageCodegenExec$$anonfun$8$$anon$1.hasNext(WholeStageCodegenExec.scala:377)
at org.apache.spark.sql.execution.datasources.FileFormatWriter$SingleDirectoryWriteTask.execute(FileFormatWriter.scala:243)
at org.apache.spark.sql.execution.datasources.FileFormatWriter$$anonfun$org$apache$spark$sql$execution$datasources$FileFormatWriter$$executeTask$3.apply(FileFormatWriter.scala:190)
at org.apache.spark.sql.execution.datasources.FileFormatWriter$$anonfun$org$apache$spar
I think you should try putting the regex inline like bellow.
This is a bit of a lame solution, you should be able to define a constant, maybe put it in a global object
or something. Im not sure where you are defining it that would be a problem. But remember spark serialises the code and runs it on distributed workers, so something could be going wrong with that.
rdd.map { _ =>
...
body = """<[^>]+>""".r.replaceAllIn(body, " ")
}
I get a very similar error when I run .r
on a null String.
val x: String = null
x.r
java.lang.NullPointerException
java.util.regex.Pattern.<init>(Pattern.java:1350)
java.util.regex.Pattern.compile(Pattern.java:1028)
scala.util.matching.Regex.<init>(Regex.scala:223)
scala.collection.immutable.StringLike.r(StringLike.scala:281)
scala.collection.immutable.StringLike.r$(StringLike.scala:281)
scala.collection.immutable.StringOps.r(StringOps.scala:29)
scala.collection.immutable.StringLike.r(StringLike.scala:270)
scala.collection.immutable.StringLike.r$(StringLike.scala:270)
scala.collection.immutable.StringOps.r(StringOps.scala:29)
That error has slightly different line numbers, I think because of the scala version. Im on 2.12.2.