Search code examples
apache-sparkapache-spark-sqldistinct-values

PySpark how do I count distinct in only one column and sort by another column in a DataFrame?


I have a DataFrame with duplicate row in column A that has difference value in column B

Example for my data:

| Column A | Column B |
| -------- | -------- |
| APPLE    | RED      |
| APPLE    | GREEN    |
| GRAPE    | BLACK    |
| BANANA   | RED      |
| BANANA   | BLUE     |
| BANANA   | GREEN    |
| BANANA   | GREEN    |

I want to count distinct in column B and also group and sort by column A

Expected data:

| Column A | Column B |
| -------- | -------- |
| APPLE    | 2        |
| GRAPE    | 1        |
| BANANA   | 3        |

Any pointers on how to approach this problem? Either PySpark or SQL can be used.


Solution

  • This should work:

    df \
        .groupBy('Column A') \
        .agg(count_distinct(col('Column B')).alias('Column B'))