Search code examples
scalaapache-sparkrdd

Find the latest / earliest day in Spark RDD


I have a m2 RDD consisting of

case class Medication(patientID: String, date: Date, medicine: String)

and I need to find the first and the last day. I tried

val latest_date_m2  = m2.maxBy(_.date).date

I got:

No implicit Ordering defined for java.sql.Date.
[error]       val latest_date_m2 = m2.maxBy(_.date).date

It looks like Scala "does not know" how to compare the dates. I think, I need replace maxBy by a different function, but I cannot find this one.


Solution

  • Just provide the Ordering

    import scala.math.Ordering
    
    object SQLDateOrdering extends Ordering[java.sql.Date] {
      def compare(a: java.sql.Date, b: java.sql.Date) = a compareTo b
    }
    
    m2.maxBy(_.date)(SQLDateOrdering)
    

    though it is worth noting that m2 cannot be RDD as RDD has no maxBy method (it is likely a Seq). If it was RDD you'd need

    object MedicationDateOrdering extends Ordering[Medication] {
      def compare(a: Medication, b: Medication) = a.date compareTo b.date
    }
    

    and max

    m2.max()(MedicationDateOrdering)