Search code examples
pythonpandaspysparkdata-analysiscrosstab

pyspark crosstab with percentages


I have a DataFrame after applying crosstab in the pyspark, Example Below

id A B C
cssdsd 0 1 0
sdsdsd 1 1 1
ssdssd 1 0 0
xxxxxx 0 0 0

Instead of 0,1's I want to get the percentages of the rows. I can be easily get that with the pandas using the crosstab function

pd.crosstab(df.index,df.list, normalize='index')

How can I get this in pyspark?


Solution

  • After getting the crosstab result which is df in the below code. Get sum of all columns except id and then divide each column with the sum:

    from pyspark.sql import functions as F
    
    cols = [i for i in df.columns if not i=='id']
    out = (df.withColumn("SumCols",F.expr('+'.join(cols)))
    .select("id",*[F.coalesce(F.round(F.col(i)/F.col("SumCols"),2),F.lit(0)).alias(i) 
                                                                     for i in cols]))
    

    out.show()
    
    +------+----+----+----+
    |    id|   A|   B|   C|
    +------+----+----+----+
    |cssdsd| 0.0| 1.0| 0.0|
    |sdsdsd|0.33|0.33|0.33|
    |ssdssd| 1.0| 0.0| 0.0|
    |xxxxxx| 0.0| 0.0| 0.0|
    +------+----+----+----+
    

    Note that you can multiply by 100 in the select statement loop if necessary.