Say I have the following dataframe:
col1 | col2 | col3
a | toto | 1
a | toto | 2
a | toto | 45
a | toto | 789
a | toto | 456
b | titi | 4
b | titi | 8
With col1
as a primary key.
I would like to know how I can determine which key in col1
has less than 5 occurences in the dataframe.
So the output should be:
col1 | col2 | col3
b | titi |
So far, i came up with the following solution:
anc_ref_window = Window.partitionBy("col1")
df\
.withColumn("temp_one", lit(1)) \
.withColumn("count", sum(col("temp_one")).over(anc_ref_window)) \
.drop("temp_one") \
.filter(col("count") < 5) \
.drop("count") \
.show()
Which gives the following results:
col1 | col2 | col3
b | titi | 4
b | titi | 8
1 - Is it the correct approach to the problem ?
2 - How can I only get the expected output ? With my version of pyspark (2.1.0), it seems there is no such mecanism like select distinct col1,col2
as I would do via Impala (for example).
Edit:
The output value in col3 does not matter for me.
@koilaro oriented me towards distinct
. However it does not provide the capacity to indicate column names in pyspark 2.1.0
.
However, dropDuplicates
does the job:
df\
.withColumn("temp_one", lit(1)) \
.withColumn("count", sum(col("temp_one")).over(anc_ref_window)) \
.drop("temp_one") \
.filter(col("count") < 5) \
.drop("count") \
.dropDuplicates(["col1"])