I have a DataFrame with duplicate row in column A that has difference value in column B
Example for my data:
| Column A | Column B |
| -------- | -------- |
| APPLE | RED |
| APPLE | GREEN |
| GRAPE | BLACK |
| BANANA | RED |
| BANANA | BLUE |
| BANANA | GREEN |
| BANANA | GREEN |
I want to count distinct in column B and also group and sort by column A
Expected data:
| Column A | Column B |
| -------- | -------- |
| APPLE | 2 |
| GRAPE | 1 |
| BANANA | 3 |
Any pointers on how to approach this problem? Either PySpark or SQL can be used.
This should work:
df \
.groupBy('Column A') \
.agg(count_distinct(col('Column B')).alias('Column B'))