How to perform same functionality as this pandas command via Pyspark dataframe or RDD ?
df.drop(df.std()[(df.std() == 0)].index, axis=1)
For details on what this command does, refer: How to drop columns which have same values in all rows via pandas or spark dataframe?
Note:
File is too big to use df.toPandas()
.
In general you can use countDistinct
:
from pyspark.sql.functions import countDistinct
cnts = (df
.select([countDistinct(c).alias(c) for c in df.columns])
.first()
.asDict())
df.select(*[k for (k, v) in cnts.items() if v > 1])
## +---+-----+-----+-----+
## | id|index| name|data1|
## +---+-----+-----+-----+
## |345| 0|name1| 3|
## | 12| 1|name2| 2|
## | 2| 5|name6| 7|
## +---+-----+-----+-----+
This won't work on data with cardinality but can handle non-numeric columns.
You can use the same approach to filter with standard deviations:
from pyspark.sql.functions import stddev
stddevs = df.select(*[stddev(c).alias(c) for c in df.columns]).first().asDict()
df.select(*[k for (k, v) in stddevs.items() if v is None or v != 0.0])
## +---+-----+-----+-----+
## | id|index| name|data1|
## +---+-----+-----+-----+
## |345| 0|name1| 3|
## | 12| 1|name2| 2|
## | 2| 5|name6| 7|
## +---+-----+-----+-----+