Getting NPE on simple Regex Replacing (Scala on Spark)

I wrote a simple code to parse a large XML file ( extract lines, clean text, and remove any html tags from it) using Apache Spark.

I'm seeing a NullPointerException when calling .replaceAllIn on a string, which is non-null.

The funny thing is that I have no errors when I run the code locally, using input from disk, but I get a NullPointerException when I run the same code on AWS EMR, loading the input file from S3.

Here is the relevant code:

val HTML_TAGS_PATTERN = """<[^>]+>""".r

// other code here...

.textFile(pathToInputFile, numPartitions)
.filter { str => str.startsWith("  <row ") }
.map { str =>

  Locale.setDefault(new Locale("en", "US"))

  val parts = str.split(""""""")

  var title: String = ""
  var body: String = ""

  // some code ommitted here

  title = StringEscapeUtils.unescapeXml(title).toLowerCase.trim
  body = StringEscapeUtils.unescapeXml(body).toLowerCase // decode xml entities

  println("before replacing, body is: "+body)

  body = HTML_TAGS_PATTERN.replaceAllIn(body, " ") // take out htmltags


Things I've tried:

  • printing the string just before calling replaceAllIn to make sure it's not null.

  • making sure the Locale is not null

  • printing out the exception message, and stacktrace: it just tells me that that line is where the NullPointerException occurs. Nothing more

Things that are different between my local setup and AWS EMR:

  • in my local setup, I load the input file from disk, on EMR I load it from s3.

  • in my local setup, I run Spark in standalone mode, on EMR it's run in cluster mode.

Everything else is the same on my machine and on AWS EMR: Scala version, Spark version, Java version, Cluster configs...

I have been trying to figure this out for some hours and I can't think of anything else to try.


I've moved the call to r() to within the map{} body, like this:

val HTML_TAGS_PATTERN = """<[^>]+>"""

// code ommited


   body = HTML_TAGS_PATTERN.r.replaceAllIn(body, " ")    


This also produces a NPE, wit the following stracktrace:

    at java.util.regex.Pattern.<init>(
    at java.util.regex.Pattern.compile(
    at scala.util.matching.Regex.<init>(Regex.scala:191)
    at scala.collection.immutable.StringLike$class.r(StringLike.scala:255)
    at scala.collection.immutable.StringOps.r(StringOps.scala:29)
    at scala.collection.immutable.StringLike$class.r(StringLike.scala:244)
    at scala.collection.immutable.StringOps.r(StringOps.scala:29)
    at ReadSOStanfordTokenize$$anonfun$2.apply(ReadSOStanfordTokenize.scala:102)
    at ReadSOStanfordTokenize$$anonfun$2.apply(ReadSOStanfordTokenize.scala:72)
    at org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIterator.processNext(Unknown Source)
    at org.apache.spark.sql.execution.BufferedRowIterator.hasNext(
    at org.apache.spark.sql.execution.WholeStageCodegenExec$$anonfun$8$$anon$1.hasNext(WholeStageCodegenExec.scala:377)
    at org.apache.spark.sql.execution.datasources.FileFormatWriter$SingleDirectoryWriteTask.execute(FileFormatWriter.scala:243)
    at org.apache.spark.sql.execution.datasources.FileFormatWriter$$anonfun$org$apache$spark$sql$execution$datasources$FileFormatWriter$$executeTask$3.apply(FileFormatWriter.scala:190)
    at org.apache.spark.sql.execution.datasources.FileFormatWriter$$anonfun$org$apache$spar


  • I think you should try putting the regex inline like bellow.

    This is a bit of a lame solution, you should be able to define a constant, maybe put it in a global object or something. Im not sure where you are defining it that would be a problem. But remember spark serialises the code and runs it on distributed workers, so something could be going wrong with that. { _ =>
       body = """<[^>]+>""".r.replaceAllIn(body, " ")    

    I get a very similar error when I run .r on a null String.

    val x: String = null 

    That error has slightly different line numbers, I think because of the scala version. Im on 2.12.2.