Search code examples
jsonapache-sparkdataframerdd

Fit a json string to a DataFrame using a schema


I have a schema that looks like this:

StructType(StructField(keys,org.apache.spark.ml.linalg.VectorUDT@3bfc3ba7,true))

I have a json string(that matches this schema) that I need to convert to fit the above schema.

"{"keys" : [2.0, 1.0]}"

How I proceed to get a dataframe out of this string to get a DataFrame that matches my schema? Following are the steps I have tried in a scala notebook:

val rddData2 = sc.parallelize("""{"keys" : [1.0 , 2.0] }""" :: Nil)
val in = session.read.schema(schema).json(rddData2)
in.show

This is the output being shown:

+-----------+
|keys       |        
+-----------+
|null       |
+-----------+

Solution

  • If you have a json string as

    val jsonString = """{"keys" : [2.0, 1.0]}"""
    

    then you can create a dataframe without schema as

    val jsonRdd = sc.parallelize(Seq(jsonString))
    val df = sqlContext.read.json(jsonRdd)
    

    which should give you

    +----------+
    |keys      |
    +----------+
    |[2.0, 1.0]|
    +----------+
    

    with schema

    root
     |-- keys: array (nullable = true)
     |    |-- element: double (containsNull = true)
    

    Now if you want to convert the array column created by default to Vector, then you would need a udf function as

    import org.apache.spark.sql.functions._
    def vectorUdf = udf((array: collection.mutable.WrappedArray[Double]) => org.apache.spark.ml.linalg.Vectors.dense(Array(array: _*)))
    

    and call the udf function using .withColumn as

    df.withColumn("keys", vectorUdf(col("keys")))
    

    You should be getting dataframe with schema as

    root
     |-- keys: vector (nullable = true)