Search code examples
scalaapache-sparkapache-spark-2.0kie

Spark: show and collect-println giving different outputs


I am using Spark 2.2

I feel like I have something odd going on here. Basic premise is that

  • I have a set of KIE/Drools rules running through a Dataset of profile objects
  • I am then trying to show/collect-print the resulting output
  • I then cast the output as a tuple to flatmap it later

Code below

implicit val mapEncoder = Encoders.kryo[java.util.HashMap[String, Any]]
implicit val recommendationEncoder = Encoders.kryo[Recommendation]
val mapper = new ObjectMapper()

val kieOuts = uberDs.map(profile => {
  val map = mapper.convertValue(profile, classOf[java.util.HashMap[String, Any]])
  val profile = Profile(map)

  // setup the kie session
  val ks = KieServices.Factory.get
  val kContainer = ks.getKieClasspathContainer
  val kSession = kContainer.newKieSession() //TODO: stateful session, how to do stateless?

  // insert profile object into kie session
  val kCmds = ks.getCommands
  val cmds = new java.util.ArrayList[Command[_]]()
  cmds.add(kCmds.newInsert(profile))
  cmds.add(kCmds.newFireAllRules("outFired"))

  // fire kie rules
  val results = kSession.execute(kCmds.newBatchExecution(cmds))
  val fired = results.getValue("outFired").toString.toInt

  // collect the inserted recommendation objects and create uid string
  import scala.collection.JavaConversions._
  var gresults = kSession.getObjects
  gresults = gresults.drop(1) // drop the inserted profile object which also gets collected

  val recommendations = scala.collection.mutable.ListBuffer[Recommendation]()
  gresults.toList.foreach(reco => {
    val recommendation = reco.asInstanceOf[Recommendation]
    recommendations += recommendation
  })
  kSession.dispose
  val uIds = StringBuilder.newBuilder
  if(recommendations.size > 0) {
    recommendations.foreach(recommendation => {
      uIds.append(recommendation.getOfferId + "_" + recommendation.getScore)
      uIds.append(";")
    })
    uIds.deleteCharAt(uIds.size - 1)
  }

  new ORecommendation(profile.getAttributes().get("cId").toString.toLong, fired, uIds.toString)
})
println("======================Output#1======================")
kieOuts.show(1000, false)
println("======================Output#2======================")
kieOuts.collect.foreach(println)

//separating cid and and each uid into individual rows
val kieOutsDs = kieOuts.as[(Long, Int, String)]
println("======================Output#3======================")
kieOutsDs.show(1000, false)

(I have sanitized/shortened the id's below, they are much bigger but with a similar format)

What I am seeing as outputs

Output#1 has a set of uIds(as String) come up

+----+-----------+---------------------------------------------------------------------------------------------------------------------------------------------------------------------------+
|cId |rulesFired |    eligibleUIds   |
|842 |         17|123-25_2.0;12345678-48_9.0;28a-ad_5.0;123-56_10.0;123-27_2.0;123-32_3.0;c6d-e5_5.0;123-26_2.0;123-51_10.0;8e8-c1_5.0;123-24_2.0;df8-ad_5.0;123-36_5.0;123-16_2.0;123-34_3.0|
+----+-----------+---------------------------------------------------------------------------------------------------------------------------------------------------------------------------+

Output#2 has mostly a similar set of uIds show up(usually off by 1 element)

ORecommendation(842,17,123-36_5.0;123-24_2.0;8e8-c1_5.0;df8-ad_5.0;28a-ad_5.0;660-73_5.0;123-34_3.0;123-48_9.0;123-16_2.0;123-51_10.0;123-26_2.0;c6d-e5_5.0;123-25_2.0;123-56_10.0;123-32_3.0)

Output#3 is same as #Output1

+----+-----------+---------------------------------------------------------------------------------------------------------------------------------------------------------------------------+
|842 |         17 |123-32_3.0;8e8-c1_5.0;123-51_10.0;123-48_9.0;28a-ad_5.0;c6d-e5_5.0;123-27_2.0;123-16_2.0;123-24_2.0;123-56_10.0;123-34_3.0;123-36_5.0;123-6_2.0;123-25_2.0;660-73_5.0|
  • Every time I run it the difference between Output#1 and Output#2 is 1 element but never the same element (In the above example, Output#1 has 123-27_2.0 but Output#2 has 660-73_5.0)

  • Should they not be the same? I am still new to Scala/Spark and feel like I am missing something very fundamental


Solution

  • I think I figured this out, adding cache to kieOuts atleast got me identical outputs between show and collect. I will be looking at why KIE gives me different output for every run of the same input but that is a different issue