Search code examples
javaapache-sparkrdddistinct-values

How to find distinct element on the basis of a particular column from JavaRDD<ObjectHandler> object?


My JavaRDD structure looks like this :-

[
ObjectHandler [username=KAJAL, properties={}, event_name=INSTALL, pname=null, ptype=null, pvalue=null, date=2016-08-02T06:48:10.108Z],
ObjectHandler [username=KAJAL, properties={}, event_name=INSTALL, pname=null, ptype=null, pvalue=null, date=2016-08-02T06:51:12.089Z], 
ObjectHandler [username=KAJAL, properties={}, event_name=INSTALL, pname=null, ptype=null, pvalue=null, date=2016-08-02T06:52:44.285Z],
ObjectHandler [username=KAJAL, properties={}, event_name=INSTALL, pname=null, ptype=null, pvalue=null, date=2016-08-02T06:54:23.250Z],
ObjectHandler [username=KAJAL, properties={}, event_name=INSTALL, pname=null, ptype=null, pvalue=null, date=2016-08-02T06:55:35.045Z],
ObjectHandler [username=Hello, properties={}, event_name=INSTALL, pname=null, ptype=null, pvalue=null, date=2016-08-02T10:40:07.929Z], 
ObjectHandler [username=Hello, properties={}, event_name=INSTALL, pname=null, ptype=null, pvalue=null, date=2016-08-02T10:40:54.602Z],
ObjectHandler [username=neelam, properties={}, event_name=INSTALL, pname=null, ptype=null, pvalue=null, date=2016-08-03T07:16:23.085Z]
]

Now i want distinct element on username like this:-

[
ObjectHandler [username=KAJAL, properties={}, event_name=INSTALL, pname=null, ptype=null, pvalue=null, date=2016-08-02T06:48:10.108Z],
ObjectHandler [username=Hello, properties={}, event_name=INSTALL, pname=null, ptype=null, pvalue=null, date=2016-08-02T10:40:54.602Z],
ObjectHandler [username=neelam, properties={}, event_name=INSTALL, pname=null, ptype=null, pvalue=null, date=2016-08-03T07:16:23.085Z]
] 

I have used JavaRDD.distinct() function but result was same. please help.


Solution

  • JavaRDD.distinct() would invoke ObjectHandler.equals to check for distinctness - if you haven't overridden it, each ObjectHandler would be considered different and thus you'd end up with the same RDD.

    You have therefore two options:

    1. Override equals and hashCode, only compare username: this is simple to do, and then distinct would return the expected result. The downside is - you might want two instances of ObjectHandler to be considered different even if they have the same username under some other circumstances. In other words, other use cases in your program would require stricter equality of these objects. If that's the case, you can use the second approach:

    2. Reduce by username: extract the username into the RDD's "key", reduce by that key while "randomly" choosing one of the matching values, and then get rid of these keys. With Java 8 this would look like:

      final JavaRDD<ObjectHandler> result = rdd
          .keyBy(v -> v.username)
          .reduceByKey((ObjectHandler v1, ObjectHandler v2) -> v1)
          .values();
      

      with Java 7 this would look a bit messy, but the logic is identical:

      final JavaRDD<ObjectHandler> result = rdd.keyBy(new Function<ObjectHandler, String>() {
          @Override
          public String call(ObjectHandler v1) throws Exception {
              return v1.username;
          }
      }).reduceByKey(new Function2<ObjectHandler, ObjectHandler, ObjectHandler>() {
          @Override
          public ObjectHandler call(ObjectHandler v1, ObjectHandler v2) throws Exception {
              return v1; // choosing one "randomly"
          }
      }).values();