Search code examples
dataframescalaapache-spark

How to use agg function while not drop other columns in dataframe?


Suppose I have a dataframe like below:

+-------------+----------+-----+------+
| key         | Time     |value|value2|
+-------------+----------+-----+------+
|  1          |        1 |  1  |   1  |
|  1          |        2 |  2  |   2  |
|  1          |        4 |  3  |   3  |
|  2          |        2 |  4  |   4  |
|  2          |        3 |  5  |   5  | 
+-------------+----------+-----+------+

I want to select the key with same value with the least time. For this case, the is the dataframe I want.

+----------+----------+-----+------+
| Time     | key      |value|value2|
+----------+----------+-----+------+
|  1       |        1 |  1  |   1  |
|  2       |        2 |  4  |   4  |
+----------+----------+-----+------+

I tried use groupBy and agg, but these operations will drop the value columns. Is there a way to keep the value1 and value2 columns?


Solution

  • You can use struct to create a tuple containing the time column and everything you want to keep (the s column in the code below). Then, you can use s.* to unfold the struct and retrieve your columns.

    val result = df
        .withColumn("s", struct('Time, 'value, 'value2))
        .groupBy("key")
        .agg(min('s) as "s")
        .select("key", "s.*")
    
    // to order the columns the way you want:
    val new_result = result.select("Time", "key", "value", "value2")
    // or
    val new_result = result.select("Time", df.columns.filter(_ != "time") : _*)