Search code examples
pythonpysparksumaggregate

Using sum in withColumn pyspark / python


I need your help in a small issue, which I actually solved in a manual way, but I would like to get it "right"

Data:

customer_id gender
abc m
def w

etc.

Now it gets aggreaget the following way:

gender_count = data.groupBy('gender').agg(F.countDistinct('customer_id').alias('amount'))

gender_count:

gender amount
m 4
w 6

Now, I would like to creat a new column with the total number of customers in order to compute the share of the individual genders. However, I could not find a function that works with "withColumn". So what I do is to sum up the number of customers before and insert it as a literal value:

gender_count = gender_count.withColumn('total', F.lit(10)).withColumn('share', (F.col('amount') / F.col('total')))

Result:

gender amount total share
m 4 10 0.4
w 6 10 0.6

Do you have any idea how i could replace the F.lit(10)? That would save me one manual step and a potential source of error.

Thank you!


Solution

  • You can use the count() on the dataframe.

    total_items = data.count()
    
    gender_count = gender_count.withColumn('total', F.lit(total_items)).withColumn('share', (F.col('amount') / F.col('total')))