Search code examples
pythonapache-sparkpysparkapache-spark-sql

Show distinct column values in pyspark dataframe


With pyspark dataframe, how do you do the equivalent of Pandas df['col'].unique().

I want to list out all the unique values in a pyspark dataframe column.

Not the SQL type way (registertemplate then SQL query for distinct values).

Also I don't need groupby then countDistinct, instead I want to check distinct VALUES in that column.


Solution

  • Let's assume we're working with the following representation of data (two columns, k and v, where k contains three entries, two unique:

    +---+---+
    |  k|  v|
    +---+---+
    |foo|  1|
    |bar|  2|
    |foo|  3|
    +---+---+
    

    With a Pandas dataframe:

    import pandas as pd
    p_df = pd.DataFrame([("foo", 1), ("bar", 2), ("foo", 3)], columns=("k", "v"))
    p_df['k'].unique()
    

    This returns an ndarray, i.e. array(['foo', 'bar'], dtype=object)

    You asked for a "pyspark dataframe alternative for pandas df['col'].unique()". Now, given the following Spark dataframe:

    s_df = sqlContext.createDataFrame([("foo", 1), ("bar", 2), ("foo", 3)], ('k', 'v'))
    

    If you want the same result from Spark, i.e. an ndarray, use toPandas():

    s_df.toPandas()['k'].unique()
    

    Alternatively, if you don't need an ndarray specifically and just want a list of the unique values of column k:

    s_df.select('k').distinct().rdd.map(lambda r: r[0]).collect()
    

    Finally, you can also use a list comprehension as follows:

    [i for i in s_df.select('k').distinct().collect()]