Search code examples
pythonapache-sparkpyspark

How to find the max value in a column in pyspark dataframe


I have a pyspark dataframe

deviceId    timestamp
009eeb      2024-04-22
009eeb      2024-04-24
7c002v      2024-04-20
7c002v      null
4fd556      null
4fd556      null

I need to get the max timestamp in the final dataframe and drop duplicates. I tried with the below code

w = Window.partitionBy('deviceId')
df_max = df.withColumn('max_col', F.max('timestamp').over(w))\
            .where((F.col('timestamp') == F.col('max_col')) | (F.col('timestamp').isNull()))            
            .dropDuplicates()

But from this code, I'm getting what I need to get.

deviceId    timestamp
009eeb      2024-04-24
7c002v      null
4fd556      null
4fd556      null

I need to get the result as below

deviceId    timestamp
009eeb      2024-04-24
7c002v      2024-04-20
4fd556      null

Please suggest any changes to be done


Solution

  • Just use the max function directly.

    from pyspark.sql import SparkSession, functions as F
    
    spark = SparkSession.builder.enableHiveSupport().getOrCreate()
    
    data = [
        ('009eeb', '2024-04-22'),
        ('009eeb', '2024-04-24'),
        ('7c002v', '2024-04-20'),
        ('7c002v', None),
        ('4fd556', None),
        ('4fd556', None),
    ]
    df = spark.createDataFrame(data, ['deviceId', 'timestamp'])
    df = df.groupBy('deviceId').agg(F.max('timestamp').alias('timestamp'))
    df.show(truncate=False)
    
    # +--------+----------+
    # |deviceId|timestamp |
    # +--------+----------+
    # |009eeb  |2024-04-24|
    # |4fd556  |null      |
    # |7c002v  |2024-04-20|
    # +--------+----------+