I am trying to save a dataframe as avro file. I have read in an xml file that has many nested layers. It stores it as a dataframe. The dataframe is stored successfully. The XML has many namespace headers such as @nso, @ns1, @ns2 etc. These become the headers in the dataframe.
When I try to save it as avro file it gives me this error:
Exception in thread "main" org.apache.avro.SchemaParseException: Illegal initial character: @ns0
Code:
val conf = new SparkConf()
.setMaster("local[2]")
.setAppName("conversion")
val sc = new SparkContext(conf)
val sqlContext = new SQLContext(sc)
val df = sqlContext.read
.format("com.databricks.spark.xml")
.option("rowTag", "Stuff")
.load("sample.xml")
df.printSchema()
df.show()
df.write
.format("com.databricks.spark.avro")
.save("output")
A valid Avro name has to start with a letter or an underscore so you have either rename columns generated from attributes or specify alternative prefix. spark-csv
allows you to configure attribute prefix using attributePrefix
property:
val df = sqlContext.read
.format("com.databricks.spark.xml")
.option("rowTag", "Stuff")
.option("attributePrefix", "attr_") // or some other prefix of your choice
.load("sample.xml")