dataframe apache-spark pyspark databricks

Get distinct rows by creation date

I am working with a dataframe like this:

DeviceNumber        | CreationDate       | Name
1001                | 1.1.2018           | Testdevice
1001                | 30.06.2019         | Device
1002                | 1.1.2019           | Lamp

I am using databricks and pyspark to do the ETL process. How can I reduce the dataframe in a way that I will only have a single row per "DeviceNumber" and that this will be the row with the highest "CreationDate"? In this example I want the result to look like this:

DeviceNumber        | CreationDate       | Name
1001                | 30.06.2019         | Device
1002                | 1.1.2019           | Lamp

Solution

You can create a additional dataframe with DeviceNumber & it's latest/max CreationDate.

import pyspark.sql.functions as psf

max_df = df\
    .groupBy('DeviceNumber')\
    .agg(psf.max('CreationDate').alias('max_CreationDate'))

and then join max_df with original dataframe.

joining_condition = [ df.DeviceNumber == max_df.DeviceNumber, df.CreationDate == max_df.max_CreationDate ]

df.join(max_df,joining_condition,'left_semi').show()

left_semi join is useful when you want second dataframe as lookup and does need any column from second dataframe.