The code below should:
It behaves as expected on small tests, but on an 8.6M item sequence of live data the output sequence is significantly longer than the input sequence:
import org.json4s._
import org.json4s.jackson.JsonMethods._
import org.apache.spark._
val txt = sc.textFile("s3n://...")
val patt="""^[a-zA-Z]\w*$""".r.findFirstIn _
val json = (for {
line <- txt
JObject(children) <- parse(line)
children2 = (for {
JField(name, value) <- children
// filter fields with invalid names
// patt(name) returns Option[String]
_ <- patt(name)
} yield JField(name.toLowerCase, value))
} yield compact(render(JObject(children2))))
I have checked that it actually increases the number of unique items, so it is not just duplicating items. Given my understanding of Scala comprehensions & json4s, I do not see how this is possible. The large live data collection is a Spark RDD, while my tests were with an ordinary Scala Seq, but that should not make any difference.
How can json
have more elements than txt
in the above code?
I was not aware that
JObject(children) <- parse(line)
matches recursively inside the result of parse
. So even though parse
returns a single value, when there are nested objects, they will be returned as separate bindings for children
.
The answer is to use
JObject(children) = parse(line)
the correct code is:
import org.json4s._
import org.json4s.jackson.JsonMethods._
import org.apache.spark._
val txt = sc.textFile("s3n://...")
val patt="""^[a-zA-Z]\w*$""".r.findFirstIn _
val json = (for {
line <- txt
JObject(children) = parse(line) // CHANGED <- TO =
children2 = (for {
JField(name, value) <- children
// filter fields with invalid names
// patt(name) returns Option[String]
_ <- patt(name)
} yield JField(name.toLowerCase, value))
} yield compact(render(JObject(children2))))