Drop column that contain certain values in a number of rows in Pyspark

So I have a pyspark dataframe, it contains 12 rows and 50 columns. I want to drop the columns that contains 0 more than 4 rows.

However the answers in the above question are only for pandas. Is there a solution for pyspark dataframe?

Solution

In pyspark, you'll have to bring the count of zeros in every column into the driver using collect(). From memory wise this should not be a big problem, because you'll have one value per column. Try this,

from pyspark.sql import functions as F
tst= sqlContext.createDataFrame([(1,0,0),(1,0,4),(1,0,10),(2,1,90),(7,2,0),(0,3,11)],schema=['group','order','value'])
expr = [F.count(F.when(F.col(coln)==0,1)).alias(coln) for coln in tst.columns]
tst_cnt = tst.select(*expr).collect()[0].asDict()
#%%
sel_coln =[x for x in tst_cnt.keys() if tst_cnt[x]<=2]
tst_final = tst.select(sel_coln)

I think, in sql syntax, you can do it in subquery.

Is there any reason to use css box-shadow over drop-shadow?
One line python code to remove all the strings that doesn't start with T or t and if they contain number greater then 6
ffmpeg using an expression for tile filter
Recursively keep only even values in a multidimensional array
How to remove an item from a list in Scala having only its index?
PySpark filtering
Filtering data in R, in a for loop
How to remove duplicate elements from each row of a 2d array?
Remove duplicate values per row of a 2d array
Find the first row with a qualifying column value in a multidimensional array
Find the parent key of the first qualifying row in each subset of a multidimensional array
Find the first row of a 2d which contains all required associative elements
Filter rows of a 2d array which contain two specified column values
Find the first occurring array row which contains a search value and return the first level key
Search for a column value in a 2d array and return its first level key
Return the first level key upon finding a qualifying subarray in a multidimensional array
Search for a value in the 3rd level of a 3d array and return the first level key
Get last row that satisfies a condition using pandas groupby
Summarize data with Java streams
Pandas equivalent of GROUP BY HAVING in SQL
Filtering multiple items in a multi-index Pandas dataframe
Creating reusable and composable filters for Pandas DataFrames
Filter an array if today is not found in a whitelist of dates
difference overlay filter ffmpeg like photoshop/affinity photo
Gallery filter plugin 'Filterizr" - mix filter mode
Use a list of values to select rows from a Pandas dataframe
How to implement a butterworth filter
Efficiently filtering comma separated strings in pandas/dask
How can I provide dynamic filtering on an entity collection?
How can I get a value from a cell of a dataframe?