Search code examples
filterpysparkdrop

Drop column that contain certain values in a number of rows in Pyspark


So I have a pyspark dataframe, it contains 12 rows and 50 columns. I want to drop the columns that contains 0 more than 4 rows.

However the answers in the above question are only for pandas. Is there a solution for pyspark dataframe?


Solution

  • In pyspark, you'll have to bring the count of zeros in every column into the driver using collect(). From memory wise this should not be a big problem, because you'll have one value per column. Try this,

    from pyspark.sql import functions as F
    tst= sqlContext.createDataFrame([(1,0,0),(1,0,4),(1,0,10),(2,1,90),(7,2,0),(0,3,11)],schema=['group','order','value'])
    expr = [F.count(F.when(F.col(coln)==0,1)).alias(coln) for coln in tst.columns]
    tst_cnt = tst.select(*expr).collect()[0].asDict()
    #%%
    sel_coln =[x for x in tst_cnt.keys() if tst_cnt[x]<=2]
    tst_final = tst.select(sel_coln)
    

    I think, in sql syntax, you can do it in subquery.