Search code examples
pythonapache-sparkpysparkapache-spark-2.0

How to count the number of occurence of a key in pyspark dataframe (2.1.0)


Context

Say I have the following dataframe:

col1 | col2 | col3
a    | toto | 1
a    | toto | 2
a    | toto | 45
a    | toto | 789
a    | toto | 456
b    | titi | 4
b    | titi | 8

With col1 as a primary key.

I would like to know how I can determine which key in col1 has less than 5 occurences in the dataframe.

So the output should be:

col1 | col2 | col3
b    | titi | 

So far, i came up with the following solution:

anc_ref_window = Window.partitionBy("col1")
df\
    .withColumn("temp_one", lit(1)) \
    .withColumn("count", sum(col("temp_one")).over(anc_ref_window)) \
    .drop("temp_one") \
    .filter(col("count") < 5) \
    .drop("count") \
    .show()

Which gives the following results:

col1 | col2 | col3
b    | titi | 4
b    | titi | 8

Questions

1 - Is it the correct approach to the problem ?

2 - How can I only get the expected output ? With my version of pyspark (2.1.0), it seems there is no such mecanism like select distinct col1,col2 as I would do via Impala (for example).

Edit:

The output value in col3 does not matter for me.


Solution

  • @koilaro oriented me towards distinct. However it does not provide the capacity to indicate column names in pyspark 2.1.0.

    However, dropDuplicates does the job:

    df\
        .withColumn("temp_one", lit(1)) \
        .withColumn("count", sum(col("temp_one")).over(anc_ref_window)) \
        .drop("temp_one") \
        .filter(col("count") < 5) \
        .drop("count") \
        .dropDuplicates(["col1"])