Search code examples
dataframeapache-sparkpysparkapache-spark-sql

Is there a way in pyspark to count unique values


I have a spark dataframe (12m x 132) and I am trying to calculate the number of unique values by column, and remove columns that have only 1 unique value.

So far, I have used the pandas nunique function as such:

import pandas as pd

df = sql_dw.read_table(<table>)
df_p = df.toPandas()

nun = df_p.nunique(axis=0)
nundf = pd.DataFrame({'atr':nun.index, 'countU':nun.values})

dropped = []
for i, j in nundf.values:
  if j == 1:
    dropped.append(i)
    df = df.drop(i)
print(dropped)

Is there a way to do this that is more native to spark - i.e. not using pandas?


Solution

  • Please have a look at the commented example below. The solution requires more python as pyspark specific knowledge.

    import pyspark.sql.functions as F
    #creating a dataframe
    columns = ['asin' ,'ctx' ,'fo' ]
    
    l = [('ASIN1','CTX1','FO1')
    ,('ASIN1','CTX1','FO1')
    ,('ASIN1','CTX1','FO2')
    ,('ASIN1','CTX2','FO1')
    ,('ASIN1','CTX2','FO2')
    ,('ASIN1','CTX2','FO2')
    ,('ASIN1','CTX2','FO3')
    ,('ASIN1','CTX3','FO1')
    ,('ASIN1','CTX3','FO3')]
    
    df=spark.createDataFrame(l, columns)
    
    df.show()
    #we create a list of functions we want to apply
    #in this case countDistinct for each column
    expr = [F.countDistinct(c).alias(c) for c in df.columns]
    
    #we apply those functions
    countdf =  df.select(*expr)
    #this df has just one row
    countdf.show()
    
    #we extract the columns which have just one value
    cols2drop = [k for k,v in countdf.collect()[0].asDict().items() if v == 1]
    df.drop(*cols2drop).show()
    

    Output:

    +-----+----+---+
    | asin| ctx| fo|
    +-----+----+---+
    |ASIN1|CTX1|FO1|
    |ASIN1|CTX1|FO1|
    |ASIN1|CTX1|FO2|
    |ASIN1|CTX2|FO1|
    |ASIN1|CTX2|FO2|
    |ASIN1|CTX2|FO2|
    |ASIN1|CTX2|FO3|
    |ASIN1|CTX3|FO1|
    |ASIN1|CTX3|FO3|
    +-----+----+---+
    
    +----+---+---+
    |asin|ctx| fo|
    +----+---+---+
    |   1|  3|  3|
    +----+---+---+
    
    +----+---+
    | ctx| fo|
    +----+---+
    |CTX1|FO1|
    |CTX1|FO1|
    |CTX1|FO2|
    |CTX2|FO1|
    |CTX2|FO2|
    |CTX2|FO2|
    |CTX2|FO3|
    |CTX3|FO1|
    |CTX3|FO3|
    +----+---+