Search code examples
scaladataframeapache-sparkparquet

Check whether nested data exist on parquet scala spark or not?


So, I have a parquet file with nested data like this. I want to process in scala using spark:

[
  {
    "sheep": {
      "id": 1,
      "data": {
        "age": 1,
        "price": 101
      }
    }
  },
  {
    "sheep": {
      "id": 2,
      "data": {
        "age":2
      }
    }
  },
  {
    "sheep": {
      "id": 3,
    }
  }
]

I tried to select the data in DataFrame, and then convert it to .csv file

def get: DataFrame = {
  sheepDF
    .select(
      $"sheep.id".as("id"),
      $"sheep.data.age".as("age"),
      $"sheep.data.price".as("price")
    )
}

Then, it will generate a file like this:

--------------------
| id | age | price |
--------------------
| 1  |  1  |  123  |
| 2  |  2  |       |
| 3  |     |       |
--------------------

I want the output file like this:

--------------------
| id | age | price |
--------------------
| 1  |  1  |  123  |
| 2  |  2  |   0   |
| 3  |     |   0   |
--------------------

I've combine when using isNaN and isNull like this article explain.

def get: DataFrame = {
  val priceCol = $"sheep.data.price"

  sheepDF
    .select(
      $"sheep.id".as("id"),
      $"sheep.data.age".as("age"),
      when(priceCol.isNaN, 0).otherwise(priceCol).as("price")
    )
}

I also see this solution here, but I can't find how to use it. Firstly, since I am using $"sheep.data.price" with $, how can I pass to the hasColumn function? And how do I refer to the df argument on the hasColumn function? Should I be passing sheepDF?

Then, when I try to combine when with hasColumn I got this error:

type mismatch;
 found   : Boolean
 required: org.apache.spark.sql.Column

I also try to read what this $"sheep.data.price" returns when there is no data found. I try to print it like this:

(lit($"sheep.data.price".getClass.toString())).as("coupon"),

But, it returns class org.apache.spark.sql.ColumnName

I know there should be a simpler solution without Try. Thank you!


Solution

  • Use col("column").

    df.select(
        col("sheep.id").as("id"),
        col("sheep.data.age").as("age"), 
        when(col("sheep.data.price").isNull, 0).otherwise(col("sheep.data.price")).as("price")).show
    

    The result is:

    +---+----+-----+
    | id| age|price|
    +---+----+-----+
    |  1|   1|  101|
    |  2|   2|    0|
    |  3|null|    0|
    +---+----+-----+