I'm working in Azure databricks, pyspark There is now an existing dataframe like below
df:
id | Name | Age | Country |
---|---|---|---|
1 | Mike | 18 | USA |
2 | Lily | 25 | Japan |
2 | Lily | 22 | Japan |
3 | John | 23 | China |
4 | Snow | 36 | Korea |
5 | Tom | 28 | Austria |
5 | Cindy | 28 | Italy |
column "id" is the primary key and cannot have duplicated values, which means all the numbers should appear only once, but now as you could see, number 2 and 5 have two rows with different values. I need to keep only one row for id 2 and one row for id 5 only. We can drop randomly and keep the other row, or we can keep only the upper row if there are duplicated rows
I need to have the following:
id | Name | Age | Country |
---|---|---|---|
1 | Mike | 18 | USA |
2 | Lily | 25 | Japan |
3 | John | 23 | China |
4 | Snow | 36 | Korea |
5 | Tom | 28 | Austria |
or below is also fine:
id | Name | Age | Country |
---|---|---|---|
1 | Mike | 18 | USA |
2 | Lily | 22 | Japan |
3 | John | 23 | China |
4 | Snow | 36 | Korea |
5 | Cindy | 28 | Italy |
How should I write the scripts?
Thanks.
you could simply use dropDuplicates()
with id
as subset column.
see below example
# given the following input
# +---+---+---+---+
# | id| c1| c2| c3|
# +---+---+---+---+
# | 1|foo|bar|baz|
# | 1|foo|baz|bar|
# | 2|foo|bar|baz|
# | 2|foo|bar|baz|
# +---+---+---+---+
data_sdf. \
dropDuplicates(subset=['id']). \
show()
# +---+---+---+---+
# | id| c1| c2| c3|
# +---+---+---+---+
# | 1|foo|bar|baz|
# | 2|foo|bar|baz|
# +---+---+---+---+