Search code examples
pythonpandaspysparkapache-spark-sqlrdd

spark Dataframe/RDD equivalent to pandas command given in description?


How to perform same functionality as this pandas command via Pyspark dataframe or RDD ?

df.drop(df.std()[(df.std() == 0)].index, axis=1)

For details on what this command does, refer: How to drop columns which have same values in all rows via pandas or spark dataframe?

Note:

File is too big to use df.toPandas().


Solution

  • In general you can use countDistinct:

    from pyspark.sql.functions import countDistinct 
    
    cnts = (df
        .select([countDistinct(c).alias(c) for c in df.columns])
        .first()
        .asDict())
    
    df.select(*[k for (k, v) in cnts.items() if v > 1])
    
    ## +---+-----+-----+-----+
    ## | id|index| name|data1|
    ## +---+-----+-----+-----+
    ## |345|    0|name1|    3|
    ## | 12|    1|name2|    2|
    ## |  2|    5|name6|    7|
    ## +---+-----+-----+-----+
    

    This won't work on data with cardinality but can handle non-numeric columns.

    You can use the same approach to filter with standard deviations:

    from pyspark.sql.functions import stddev
    
    stddevs = df.select(*[stddev(c).alias(c) for c in df.columns]).first().asDict()
    
    df.select(*[k for (k, v) in stddevs.items() if v is None or v != 0.0])
    
    ## +---+-----+-----+-----+
    ## | id|index| name|data1|
    ## +---+-----+-----+-----+
    ## |345|    0|name1|    3|
    ## | 12|    1|name2|    2|
    ## |  2|    5|name6|    7|
    ## +---+-----+-----+-----+