Search code examples
apache-sparkpysparkapache-spark-sqlrowunique

Count unique values in a row


The test data:

df = spark.createDataFrame([(1, 1), (2, 3), (3, 3)], ['c1', 'c2'])
df.show()
#+---+---+
#| c1| c2|
#+---+---+
#|  1|  1|
#|  2|  3|
#|  3|  3|
#+---+---+

I intend to count distinct values in every row, creating a separate column with the count. How to do it?

The desired result:

#+---+---+---+
#| c1| c2| c3|
#+---+---+---+
#|  1|  1|  1|
#|  2|  3|  2|
#|  3|  3|  1|
#+---+---+---+

Solution

  • Check the size of array_distinct:

    import pyspark.sql.functions as F
    
    df.withColumn('c3', F.size(F.array_distinct(F.array(*df.columns)))).show()
    +---+---+---+
    | c1| c2| c3|
    +---+---+---+
    |  1|  1|  1|
    |  2|  3|  2|
    |  3|  3|  1|
    +---+---+---+