Search code examples
scaladataframeapache-sparkparquet

Why I'm getting type mismatch when filtering rows for a specific aggregate with spark sql?


So I have this dataframe sheepDF:

--------------------------------------
| sheep_id | sheep_age | sheep_color |
--------------------------------------
|    1     |     1     |    gray     |
|    2     |     2     |    grey     |
|    3     |     2     |             |
|    4     |     3     |    white    |
|    5     |     3     |             |
--------------------------------------

I want to group by sheep_age, but exclude the empty string from aggregate. So the result should be like this:

--------------------------------------
| sheep_id | sheep_age | sheep_color |
--------------------------------------
|    1     |     1     |    gray     |
|   2,3    |     2     |    grey     |
|   4,5    |     3     |    white    |
--------------------------------------

instead of this:

--------------------------------------
| sheep_id | sheep_age | sheep_color |
--------------------------------------
|    1     |     1     |    gray     |
|   2,3    |     2     |   grey,     |
|   4,5    |     3     |   white,    |
--------------------------------------

I tried following this solution to solve my problem.

Here is my code:

def get: DataFrame = {
  sheepDF
    .select(
      $"sheep_id".as("sheep_id"),
      $"sheep_age".as("sheep_age"),
      $"sheep_color".as("sheep_color")
    )
    .groupBy(
      $"sheep_age"
    )
    .agg(
      concat_ws(",",
        collect_list(
          $"sheep_id"
        ))
        .as("sheep_ids"),
      concat_ws(",",
        collect_list(
          when($"sheep_color" != "", $"sheep_color")
        ))
        .as("sheep_colors")
    )
}

But, I got this error:

type mismatch;
[error]  found   : Boolean
[error]  required: org.apache.spark.sql.Column
[error]             when($"sheep_color" != "",
[error]                                      ^

Why it tells me required: org.apache.spark.sql.Column? What am I missing? When function should need Boolean right?


Solution

  • What you want here is a column of booleans while != returns a boolean and not a column. Spark defines two equality operators for columns: equal to === and not equal to =!=. Both these returns a column of booleans which are the result of the comparisons of the elements of the two columns being compared.

    In short, change $"sheep_color" != "" to $"sheep_color" =!= "" to solve the error.


    Some more insights of why === is used instead of == can be found here: Difference between == and === in Scala, Spark