apache-spark pyspark apache-spark-sql apache-zeppelin

What's the best way to show distinct values for a dataframe in pyspark?

I'd like to check the distinct values for a data frame and I know there are a way ways that I can do it. I'd like to look at the unique values for columns rabbit, platypus and book.

This is the first way

mydf
.select("rabbit", "platypus", "book")
.distinct
.show

This is the second way

mydf
.select("rabbit", "platypus", "book")
.distinct
.count

This is another way

 val rabbit = mydf.groupByKey(log => {
     val rabbit = mydf.rabbit
     rabbit
 }).count.collect

 val platypus = mydf.groupByKey(log => {
     val platypus = mydf.platypus 
     platypus
 }).count.collect

 val book = mydf.groupByKey(log => {
     val book = mydf.book 
     book
     }).count.collect

Solution

.collect will get all the results back to driver and cause OOM errors on big datasets.

Use .distinct() method and if you want count of distinct records then use df.distinct().count().