Search code examples
apache-sparkpysparkapache-spark-sql

Calculating percentage of total count for groupBy using pyspark


I have the following code in pyspark, resulting in a table showing me the different values for a column and their counts. I want to have another column showing what percentage of the total count does each row represent. How do I do that?

difrgns = (
    df1
    .groupBy("column_name")
    .count()
    .sort(desc("count"))
    .show()
)

Solution

  • An example as an alternative if not comfortable with Windowing as the comment alludes to and is the better way to go:

    # Running in Databricks, not all stuff required
    from pyspark.sql import Row
    from pyspark.sql import SparkSession
    import pyspark.sql.functions as F
    
    data = [
        ("A", "X", 2, 100),
        ("A", "X", 7, 100),
        ("B", "X", 10, 100),
        ("C", "X", 1, 100),
        ("D", "X", 50, 100),
        ("E", "X", 30, 100),
    ]
    
    rdd = sc.parallelize(data)
    
    schema = rdd.map(lambda x: Row(c1=x[0], c2=x[1], val1=int(x[2]), val2=int(x[3])))
    
    df = sqlContext.createDataFrame(schema)
    
    tot = df.count()
    
    (
        df
        .groupBy("c1")
        .count()
        .withColumnRenamed('count', 'cnt_per_group')
        .withColumn('perc_of_count_total', (F.col('cnt_per_group') / tot) * 100 )
        .show()
    )
    

    Which returns:

    +---+-------------+-------------------+
    | c1|cnt_per_group|perc_of_count_total|
    +---+-------------+-------------------+
    |  E|            1| 16.666666666666664|
    |  B|            1| 16.666666666666664|
    |  D|            1| 16.666666666666664|
    |  C|            1| 16.666666666666664|
    |  A|            2|  33.33333333333333|
    +---+-------------+-------------------+
    

    I focus on Scala and it seems easier with that. That said, the suggested solution via the comments uses Window which is what I would do in Scala with over().