Search code examples
apache-sparkapache-spark-dataset

In Apache Spark Java how can I remove elements from a dataset where some field does not match


I have a dataset say:

a=1, b=2
a=2, b=2
a=2, b=3

...and I want to drop records where a has the same value but b has a different value. In this case dropping both records where a=2.

I suspect I need to groupBy for a then some kind of filtering where b != b.


Solution

  • I did my solution using scala, just follow the idea:

    import org.apache.spark.sql.functions._
    import org.apache.spark.sql.expressions.{Window}
    
    val df = sc.parallelize(Seq(
      (1, 2),
      (2, 2),
      (2, 3)
    )).toDF("a", "b")
    
    val window = Window.partitionBy("a")
    val newDF = (df
                 .withColumn("count", count(lit(1)).over(window))
                 .where(col("count") === lit(1))
                 .drop("count"))
    
    newDF.show
    

    Output:

    +---+---+
    |  a|  b|
    +---+---+
    |  1|  2|
    +---+---+