apache-spark apache-spark-sql distinct-values

PySpark how do I count distinct in only one column and sort by another column in a DataFrame?

I have a DataFrame with duplicate row in column A that has difference value in column B

Example for my data:

| Column A | Column B |
| -------- | -------- |
| APPLE    | RED      |
| APPLE    | GREEN    |
| GRAPE    | BLACK    |
| BANANA   | RED      |
| BANANA   | BLUE     |
| BANANA   | GREEN    |
| BANANA   | GREEN    |

I want to count distinct in column B and also group and sort by column A

Expected data:

| Column A | Column B |
| -------- | -------- |
| APPLE    | 2        |
| GRAPE    | 1        |
| BANANA   | 3        |

Any pointers on how to approach this problem? Either PySpark or SQL can be used.

Solution

This should work:

df \
    .groupBy('Column A') \
    .agg(count_distinct(col('Column B')).alias('Column B'))