Search code examples
scalaapache-sparkapache-spark-sqlapache-zeppelin

Reduce size of Spark Dataframe by selecting only every n th element with Scala


I v got an org.apache.spark.sql.Dataframe = [t: double, S: long]

enter image description here

Now I want to reduce the Dataframe by every 2nd element, with val n=2

Result should be

enter image description here

How would u solve this problem?

I tried it by inserting a third column and using modulo, but I couldn’t solve it.


Solution

  • If i understand your question correctly, you want to keep every nth element from your dataframe and remove every other row. Assuming t is not your row index,add an index row and then filter it by:

    import org.apache.spark.sql.expressions._
    
    val n = 2
    val filteredDF = df.withColumn("index", row_number().over(Window.orderBy(monotonically_increasing_id))).filter($"index" % n === 0)