I have a pyspark dataframe
deviceId timestamp
009eeb 2024-04-22
009eeb 2024-04-24
7c002v 2024-04-20
7c002v null
4fd556 null
4fd556 null
I need to get the max timestamp in the final dataframe and drop duplicates. I tried with the below code
w = Window.partitionBy('deviceId')
df_max = df.withColumn('max_col', F.max('timestamp').over(w))\
.where((F.col('timestamp') == F.col('max_col')) | (F.col('timestamp').isNull()))
.dropDuplicates()
But from this code, I'm getting what I need to get.
deviceId timestamp
009eeb 2024-04-24
7c002v null
4fd556 null
4fd556 null
I need to get the result as below
deviceId timestamp
009eeb 2024-04-24
7c002v 2024-04-20
4fd556 null
Please suggest any changes to be done
Just use the max
function directly.
from pyspark.sql import SparkSession, functions as F
spark = SparkSession.builder.enableHiveSupport().getOrCreate()
data = [
('009eeb', '2024-04-22'),
('009eeb', '2024-04-24'),
('7c002v', '2024-04-20'),
('7c002v', None),
('4fd556', None),
('4fd556', None),
]
df = spark.createDataFrame(data, ['deviceId', 'timestamp'])
df = df.groupBy('deviceId').agg(F.max('timestamp').alias('timestamp'))
df.show(truncate=False)
# +--------+----------+
# |deviceId|timestamp |
# +--------+----------+
# |009eeb |2024-04-24|
# |4fd556 |null |
# |7c002v |2024-04-20|
# +--------+----------+