Search code examples
scalaapache-sparkapache-spark-sql

How to keep one specific row while dropping all other unnecessary rows in scala spark dataframe?


Just as the title, in scala spark, if I have a dataframe like below:

+-------------+----------+
| key         | Time     |
+-------------+----------+
|  1          |        1 |  
|  1          |        2 | 
|  1          |        4 |
|  2          |        2 | 
|  2          |        3 | 
+-------------+----------+

For same key, how can I only keep the key with the least time and drop all other unnecessary row? In this case, for key 1, it has 3 rows with different time, the least time is 1, so I only want to keep the key 1, time 1 row and drop other 2 rows for key 1. Same with key 2, I only want to keep key 2, time 2, so I drop the key 2, time 3 row. The format of key is LongType and the format of time is StringType. If there is some way to achieve this?

+-------------+----------+
| key         | Time     |
+-------------+----------+
|  1          |        1 |  
|  2          |        2 | 
+-------------+----------+

I tried to use drop or filter function, but I don't think they works.


Solution

  • Try something similar to this, my own data here of course:

    %scala
    
    import spark.implicits._
    import org.apache.spark.sql.functions._
    
    val df = sc.parallelize(Seq( (1,7), (1,8), (1,9), (2,2), (2,99) )).toDF("i", "c")
    df.groupBy($"i").agg(min($"c")).show()
    

    returns:

    +---+------+
    |  i|min(c)|
    +---+------+
    |  1|     7|
    |  2|     2|
    +---+------+