Search code examples
scalaapache-sparkapache-spark-sqlavroparquet

Spark avro to parquet


I have a stream of avro formatted data (json encoded) which needs to be stored as parquet files. I could only do this,

val df = sqc.read.json(jsonRDD).toDF()

and write the df as parquet.

Here the schema is inferred form the json. But i already have the avsc file and I don't want spark to infer the schema from the json.

And in the above mentioned way the parquet files store the schema info as StructType and not as avro.record.type. Is there a way to store the avro schema information as well.

SPARK - 1.4.1


Solution

  • Ended up using the answer for this question avro-schema-to-spark-structtype

    def getSparkSchemaForAvro(sqc: SQLContext, avroSchema: Schema): StructType = {
        val dummyFIle = File.createTempFile("avro_dummy", "avro")
        val datumWriter = new GenericDatumWriter[wuser]()
        datumWriter.setSchema(avroSchema)
        val writer = new DataFileWriter(datumWriter).create(avroSchema, dummyFIle)
        writer.flush()
        writer.close()
        val df = sqc.read.format("com.databricks.spark.avro").load(dummyFIle.getAbsolutePath)
        df.schema
    }