So, I have a parquet file with nested data like this. I want to process in scala using spark:
[
{
"sheep": {
"id": 1,
"data": {
"age": 1,
"price": 101
}
}
},
{
"sheep": {
"id": 2,
"data": {
"age":2
}
}
},
{
"sheep": {
"id": 3,
}
}
]
I tried to select the data in DataFrame, and then convert it to .csv
file
def get: DataFrame = {
sheepDF
.select(
$"sheep.id".as("id"),
$"sheep.data.age".as("age"),
$"sheep.data.price".as("price")
)
}
Then, it will generate a file like this:
--------------------
| id | age | price |
--------------------
| 1 | 1 | 123 |
| 2 | 2 | |
| 3 | | |
--------------------
I want the output file like this:
--------------------
| id | age | price |
--------------------
| 1 | 1 | 123 |
| 2 | 2 | 0 |
| 3 | | 0 |
--------------------
I've combine when
using isNaN
and isNull
like this article explain.
def get: DataFrame = {
val priceCol = $"sheep.data.price"
sheepDF
.select(
$"sheep.id".as("id"),
$"sheep.data.age".as("age"),
when(priceCol.isNaN, 0).otherwise(priceCol).as("price")
)
}
I also see this solution here, but I can't find how to use it. Firstly, since I am using $"sheep.data.price"
with $
, how can I pass to the hasColumn
function? And how do I refer to the df
argument on the hasColumn
function? Should I be passing sheepDF
?
Then, when I try to combine when
with hasColumn
I got this error:
type mismatch;
found : Boolean
required: org.apache.spark.sql.Column
I also try to read what this $"sheep.data.price"
returns when there is no data found. I try to print it like this:
(lit($"sheep.data.price".getClass.toString())).as("coupon"),
But, it returns class org.apache.spark.sql.ColumnName
I know there should be a simpler solution without Try
. Thank you!
Use col("column").
df.select(
col("sheep.id").as("id"),
col("sheep.data.age").as("age"),
when(col("sheep.data.price").isNull, 0).otherwise(col("sheep.data.price")).as("price")).show
The result is:
+---+----+-----+
| id| age|price|
+---+----+-----+
| 1| 1| 101|
| 2| 2| 0|
| 3|null| 0|
+---+----+-----+