Search code examples
apache-sparkapache-spark-1.3

In Spark How do i read a field by its name itself instead by its index


I use Spark 1.3.

My data has 50 and more attributes and hence I went for a custom class.

How do I access a Field from a Custom Class by its name not by its position

Here every time I need to invoke a method productElement(0)

Also i am not supposed to use case class , Hence i am using a Custom class for schema.

 class OnlineEvents(gsm_id:String,
          attribution_id:String,
          event_date:String,
          event_timestamp:String,
          event_type:String
          ) extends Product {

  override def productElement(n: Int): Any = n match {
  case 0 => impression_id
  case 1 => attribution_id
  case 2 => event_date
  case 3 => event_timestamp
  case 4 => event_type

  case _ => throw new IndexOutOfBoundsException(n.toString)
 }

  override def productArity: Int = 5

  override def canEqual(that: Any): Boolean = that.isInstanceOf[OnlineEvents]

 }

My Spark Code :

  val onlineRDD = sc.textFile("/user/cloudera/input_files/online_events.txt")

  val schemaRDD = onlineRDD.map(record => {
                                         val arr: Array[String] = record.split(",")
                                          new OnlineEvents(arr(0),arr(1),arr(2),arr(3),arr(4))
})
 val keyvalueRDD =  schemaRDD .map(online => ((online.productElement(0).toString,online.productElement(4).toString),online))

If i try to access any field from OnlineEvents then i need to use productElement() .(i.e online.productElement(0) for gsm_id )

Can i directly access the field as online.gsm_id ... online.event_type , so that my code is easily readable

How do i directly access a field by its name when i use Custom Class for schema?


Solution

  • I strongly recommend using a case class per use case (which all together cover all the use cases that use the data).

    A single use case would then be a single case class that would save you a lot of thinking about how to maintain the 50+ fields.

    Yeah, you'd "trade" a single big 50-or-more-field class for 10 5-field case classes, but given how easy it is to create a case class and how nicely they would describe your data I think it's worth the hassle.