I have a df derived from clustering that looks like this:
Cluster | Variable 1 | Variable 2 |
---|---|---|
0 | 334 | 32 |
0 | 0 | 45 |
3 | 453 | 0 |
3 | 320 | 0 |
0 | 0 | 28 |
1 | 467 | 49 |
3 | 324 | 16 |
1 | 58 | 2 |
And i'm trying to achive the next result for each cluster and every variable:
Variable 1
Cluster | %of0 | %ofvals != 0 | Count of vals != 0 | Sum of values | %universe |
---|---|---|---|---|---|
0 | 67 | 33 | 1 | 334 | 17 |
1 | 0 | 100 | 2 | 525 | 27 |
3 | 0 | 100 | 3 | 1097 | 56 |
Variable 2
Cluster | %of0 | %ofvals != 0 | Count of vals != 0 | Sum of values | %universe |
---|---|---|---|---|---|
0 | 0 | 100 | 0 | 105 | 61 |
1 | 0 | 100 | 0 | 51 | 29 |
3 | 67 | 33 | 1 | 16 | 10 |
Note: % universe is the total sum of values of every variable, in this case for variable 1 would be: 334 + 525 + 1097 = 1956 (this is 100% so 334 its 17% of this total).
I'm in the process of learning Pyspark and I'm struggling with the syntax, this is the code I'm trying but i'm at loss because I don´t know how to manage the filterings to iterate for variable and for cluster:
for i in list_of_variables:
print(i)
df.groupBy('Cluster').agg((count((col(i) == 0) / df.filter(col('Cluster') == 0).count()) * 100).alias('% of 0'), (count((col(i) != 0) / df.filter(col('Cluster') == 0).count() * 100).alias('% of vals diff than 0')..
I would be very grateful for any ideas that could give me light on how to materialize this objective. Have an awesome day!
Maybe you could try with something like this to obtain the part of counts:
for i in list:
print(i)
output = df.filter(col(i) != 0).groupBy(col('Cluster')).agg(
count(col('*')).alias('Count_vals_dif_0')).show()