Search code examples
pythondatabasedataframepysparkazure-databricks

Pyspark dataframe: How to randomly drop one row if there are two duplicated rows for the same primary key? Databricks


I'm working in Azure databricks, pyspark There is now an existing dataframe like below

df:

id Name Age Country
1 Mike 18 USA
2 Lily 25 Japan
2 Lily 22 Japan
3 John 23 China
4 Snow 36 Korea
5 Tom 28 Austria
5 Cindy 28 Italy

column "id" is the primary key and cannot have duplicated values, which means all the numbers should appear only once, but now as you could see, number 2 and 5 have two rows with different values. I need to keep only one row for id 2 and one row for id 5 only. We can drop randomly and keep the other row, or we can keep only the upper row if there are duplicated rows

I need to have the following:

id Name Age Country
1 Mike 18 USA
2 Lily 25 Japan
3 John 23 China
4 Snow 36 Korea
5 Tom 28 Austria

or below is also fine:

id Name Age Country
1 Mike 18 USA
2 Lily 22 Japan
3 John 23 China
4 Snow 36 Korea
5 Cindy 28 Italy

How should I write the scripts?

Thanks.


Solution

  • you could simply use dropDuplicates() with id as subset column.

    see below example

    # given the following input
    # +---+---+---+---+
    # | id| c1| c2| c3|
    # +---+---+---+---+
    # |  1|foo|bar|baz|
    # |  1|foo|baz|bar|
    # |  2|foo|bar|baz|
    # |  2|foo|bar|baz|
    # +---+---+---+---+
    
    data_sdf. \
        dropDuplicates(subset=['id']). \
        show()
    
    # +---+---+---+---+
    # | id| c1| c2| c3|
    # +---+---+---+---+
    # |  1|foo|bar|baz|
    # |  2|foo|bar|baz|
    # +---+---+---+---+