I have RDD[Map[String, String]]
, need to convert to datframe
so that , I could save data in parquet
file where map keys is column name.
For example:
val inputRdf = spark.sparkContext.parallelize(List(Map("city" -> "", "ip" -> "", "source" -> "PlayStore","createdDate"->"2020-04-21"),
Map("city" -> "delhi", "ip" -> "", "source" -> "PlayStore","createdDate"->"2020-04-21"),
Map("city" -> "", "ip" -> "", "source" -> "PlayStore","createdDate"->"2020-04-22")))
City | ip
Delhi| 1.234
There I put some guidance to resolve your problem
import org.apache.log4j.{Level, Logger}
import org.apache.spark.sql.SparkSession
object MapToDfParquet {
val spark = SparkSession
.config("spark.sql.shuffle.partitions","4") //Change to a more reasonable default number of partitions for our data
.config("spark.app.id","MapToDfParquet") // To silence Metrics warning
val sc = spark.sparkContext
val sqlContext = spark.sqlContext
def main(args: Array[String]): Unit = {
try {
import spark.implicits._
val data = Seq(Map("city" -> "delhi", "ip" -> "", "source" -> "PlayStore","createdDate"->"2020-04-21"),
Map("city" -> "", "ip" -> "", "source" -> "PlayStore","createdDate"->"2020-04-22"))
.map( seq => seq.values.mkString(","))
val df = sc.parallelize(data)
.map(str => str.split(","))
.map(arr => (arr(0),arr(1),arr(2),arr(3)))
.toDF("city", "ip","source","createdDate")
df.show(truncate = false)
// by default writes it will write as parquet with snappy compression
// we change this behavior and save as parquet uncompressed
// To have the opportunity to view the web console of Spark: http://localhost:4040/
println("Type whatever to the console to exit......")
} finally {
println("SparkContext stopped")
println("SparkSession stopped")
expected output
|city |ip |source |createdDate|
|delhi| |PlayStore|2020-04-21 |
| ||PlayStore|2020-04-22 |