Search code examples
scalaapache-spark

How to extract value from json column in spark scala?


I have the following code:

class MyTest extends AnyFlatSpec with Matchers {

   ....

  it should "calculate" in {

    val testDf= Seq(
      testDf(1, "customer1", "Hi"),
      testDf(1, "customer2", "Hi")
    ).toDS().toDF()


    val out = MyClass.procOut(spark, testDf)
    out.count() should be(1)
    out.where(col("customer_id")==="customer1").first().getString(output.first().fieldIndex("json_col")) should be(?) // Here I have problem!

  }

}

My issue: out is dataframe the:

out.where(col("customer_id")==="customer1").first().getString(output.first().fieldIndex("json_col"))

extract the json column so it has:

{
  "statistics": {
    "Group2": {
      "buy": 1
    }
  }
}

my goal is to get the value of buy e.g to assert that the value in the json is 1.

so I want to do something like:

out.where(col("customer_id")==="customer1").first().getString(output.first().fieldIndex("json_col")["statistics"]["Group2"]["buy"]) should be (1)

obviously this is not the right syntax but I can't find what is the right syntax to extract specific value from json.


Solution

    • json_tuple - Extract the Data from JSON and create them as a new columns.
    val jsonString="""{"Zipcode":704,"ZipCodeType":"STANDARD","City":"PARC PARQUE","State":"PR"}"""
    val data = Seq((1, jsonString))
    import spark.implicits._
    val df=data.toDF("id","value")
    
    import org.apache.spark.sql.functions.{json_tuple}
    df.select(col("id"),json_tuple(col("value"),"Zipcode","ZipCodeType","City"))
      .toDF("id","Zipcode","ZipCodeType","City")
      .show(false)
    
    • get_json_object – Extracts JSON element from a JSON string based on json path specified.
    import org.apache.spark.sql.functions.{get_json_object}
    df.select(col("id"),get_json_object(col("value"),"$.ZipCodeType").as("ZipCodeType"))
      .show(false)