Search code examples
xmlparsingapache-sparkrddavro

Filtering dataframe in spark and saving as avro


I am trying to save a dataframe as avro file. I have read in an xml file that has many nested layers. It stores it as a dataframe. The dataframe is stored successfully. The XML has many namespace headers such as @nso, @ns1, @ns2 etc. These become the headers in the dataframe.

When I try to save it as avro file it gives me this error:

Exception in thread "main" org.apache.avro.SchemaParseException: Illegal initial character: @ns0

Code:

val conf = new SparkConf()
         .setMaster("local[2]")
         .setAppName("conversion")
val sc = new SparkContext(conf)

val sqlContext = new SQLContext(sc)

val df = sqlContext.read
  .format("com.databricks.spark.xml")
  .option("rowTag", "Stuff")
  .load("sample.xml")

df.printSchema()
df.show()

df.write
  .format("com.databricks.spark.avro")
  .save("output")

Solution

  • A valid Avro name has to start with a letter or an underscore so you have either rename columns generated from attributes or specify alternative prefix. spark-csv allows you to configure attribute prefix using attributePrefix property:

    val df = sqlContext.read
      .format("com.databricks.spark.xml")
      .option("rowTag", "Stuff")
      .option("attributePrefix", "attr_")  // or some other prefix of your choice
      .load("sample.xml")