So I have a pyspark dataframe, it contains 12 rows and 50 columns. I want to drop the columns that contains 0 more than 4 rows.
However the answers in the above question are only for pandas. Is there a solution for pyspark dataframe?
In pyspark, you'll have to bring the count of zeros in every column into the driver using collect(). From memory wise this should not be a big problem, because you'll have one value per column. Try this,
from pyspark.sql import functions as F
tst= sqlContext.createDataFrame([(1,0,0),(1,0,4),(1,0,10),(2,1,90),(7,2,0),(0,3,11)],schema=['group','order','value'])
expr = [F.count(F.when(F.col(coln)==0,1)).alias(coln) for coln in tst.columns]
tst_cnt = tst.select(*expr).collect()[0].asDict()
#%%
sel_coln =[x for x in tst_cnt.keys() if tst_cnt[x]<=2]
tst_final = tst.select(sel_coln)
I think, in sql syntax, you can do it in subquery.