I need your help in a small issue, which I actually solved in a manual way, but I would like to get it "right"
Data:
customer_id | gender |
---|---|
abc | m |
def | w |
etc.
Now it gets aggreaget the following way:
gender_count = data.groupBy('gender').agg(F.countDistinct('customer_id').alias('amount'))
gender_count:
gender | amount |
---|---|
m | 4 |
w | 6 |
Now, I would like to creat a new column with the total number of customers in order to compute the share of the individual genders. However, I could not find a function that works with "withColumn". So what I do is to sum up the number of customers before and insert it as a literal value:
gender_count = gender_count.withColumn('total', F.lit(10)).withColumn('share', (F.col('amount') / F.col('total')))
Result:
gender | amount | total | share |
---|---|---|---|
m | 4 | 10 | 0.4 |
w | 6 | 10 | 0.6 |
Do you have any idea how i could replace the F.lit(10)? That would save me one manual step and a potential source of error.
Thank you!
You can use the count() on the dataframe.
total_items = data.count()
gender_count = gender_count.withColumn('total', F.lit(total_items)).withColumn('share', (F.col('amount') / F.col('total')))