Search code examples
dataframeapache-sparkpysparkdatabricks

Get distinct rows by creation date


I am working with a dataframe like this:

DeviceNumber        | CreationDate       | Name
1001                | 1.1.2018           | Testdevice
1001                | 30.06.2019         | Device
1002                | 1.1.2019           | Lamp

I am using databricks and pyspark to do the ETL process. How can I reduce the dataframe in a way that I will only have a single row per "DeviceNumber" and that this will be the row with the highest "CreationDate"? In this example I want the result to look like this:

DeviceNumber        | CreationDate       | Name
1001                | 30.06.2019         | Device
1002                | 1.1.2019           | Lamp

Solution

  • You can create a additional dataframe with DeviceNumber & it's latest/max CreationDate.

    import pyspark.sql.functions as psf
    
    max_df = df\
        .groupBy('DeviceNumber')\
        .agg(psf.max('CreationDate').alias('max_CreationDate'))
    

    and then join max_df with original dataframe.

    joining_condition = [ df.DeviceNumber == max_df.DeviceNumber, df.CreationDate == max_df.max_CreationDate ]
    
    df.join(max_df,joining_condition,'left_semi').show()
    

    left_semi join is useful when you want second dataframe as lookup and does need any column from second dataframe.