So I have this dataframe sheepDF
:
--------------------------------------
| sheep_id | sheep_age | sheep_color |
--------------------------------------
| 1 | 1 | gray |
| 2 | 2 | grey |
| 3 | 2 | |
| 4 | 3 | white |
| 5 | 3 | |
--------------------------------------
I want to group by sheep_age
, but exclude the empty string from aggregate. So the result should be like this:
--------------------------------------
| sheep_id | sheep_age | sheep_color |
--------------------------------------
| 1 | 1 | gray |
| 2,3 | 2 | grey |
| 4,5 | 3 | white |
--------------------------------------
instead of this:
--------------------------------------
| sheep_id | sheep_age | sheep_color |
--------------------------------------
| 1 | 1 | gray |
| 2,3 | 2 | grey, |
| 4,5 | 3 | white, |
--------------------------------------
I tried following this solution to solve my problem.
Here is my code:
def get: DataFrame = {
sheepDF
.select(
$"sheep_id".as("sheep_id"),
$"sheep_age".as("sheep_age"),
$"sheep_color".as("sheep_color")
)
.groupBy(
$"sheep_age"
)
.agg(
concat_ws(",",
collect_list(
$"sheep_id"
))
.as("sheep_ids"),
concat_ws(",",
collect_list(
when($"sheep_color" != "", $"sheep_color")
))
.as("sheep_colors")
)
}
But, I got this error:
type mismatch;
[error] found : Boolean
[error] required: org.apache.spark.sql.Column
[error] when($"sheep_color" != "",
[error] ^
Why it tells me required: org.apache.spark.sql.Column
? What am I missing? When
function should need Boolean
right?
What you want here is a column of booleans while !=
returns a boolean and not a column. Spark defines two equality operators for columns: equal to ===
and not equal to =!=
. Both these returns a column of booleans which are the result of the comparisons of the elements of the two columns being compared.
In short, change $"sheep_color" != ""
to $"sheep_color" =!= ""
to solve the error.
Some more insights of why ===
is used instead of ==
can be found here: Difference between == and === in Scala, Spark