Search code examples
scalaapache-sparkscala-2.10for-comprehensionjson4s

Strange behaviour of Scala for-comprehension and json4s


The code below should:

  • iterate over a sequence of strings
  • parse each one as json,
  • filter out fields whose names could not be used as an identifier in most languages
  • lowercase the rmaining names
  • serialize the result as a string

It behaves as expected on small tests, but on an 8.6M item sequence of live data the output sequence is significantly longer than the input sequence:

import org.json4s._
import org.json4s.jackson.JsonMethods._
import org.apache.spark._

val txt = sc.textFile("s3n://...")
val patt="""^[a-zA-Z]\w*$""".r.findFirstIn _
val json = (for {
         line <- txt
         JObject(children) <- parse(line)
         children2 = (for {
           JField(name, value) <- children

           // filter fields with invalid names
           // patt(name) returns Option[String]
           _ <- patt(name)

         } yield JField(name.toLowerCase, value))
} yield compact(render(JObject(children2))))

I have checked that it actually increases the number of unique items, so it is not just duplicating items. Given my understanding of Scala comprehensions & json4s, I do not see how this is possible. The large live data collection is a Spark RDD, while my tests were with an ordinary Scala Seq, but that should not make any difference.

How can json have more elements than txt in the above code?


Solution

  • I was not aware that

    JObject(children) <- parse(line)
    

    matches recursively inside the result of parse. So even though parse returns a single value, when there are nested objects, they will be returned as separate bindings for children. The answer is to use

    JObject(children) = parse(line)
    

    the correct code is:

    import org.json4s._
    import org.json4s.jackson.JsonMethods._
    import org.apache.spark._
    
    val txt = sc.textFile("s3n://...")
    val patt="""^[a-zA-Z]\w*$""".r.findFirstIn _
    val json = (for {
             line <- txt
             JObject(children) = parse(line) // CHANGED <- TO =
             children2 = (for {
               JField(name, value) <- children
    
               // filter fields with invalid names
               // patt(name) returns Option[String]
               _ <- patt(name)
    
             } yield JField(name.toLowerCase, value))
    } yield compact(render(JObject(children2))))