I'm working on a system that that will be acting as an OLAP engine for a simulation toolchain dataset. The tools generate their results in XML.
The easiest and most simple solution to me would have been to simply use spark-xml to access the XML files directly with python, Scala, etc. But the problem is that the project owners want to use C# as that is what the original simulation toolchain is built in. I know there is SparkCLR for C# but I don't know of a good way of using Spark-XML within C#.
Does anyone have any suggestions on how to do this? If not I guess the next option would be to translate the datasets into something more native for SparkCLR but not sure of the best approach.
SparkCLR works with spark-xml. The following code shows how to use C# to process XML as Spark DataFrame. You can use this code sample to start building your XML processing C# application for Spark. This sample implements the same example available at https://github.com/databricks/spark-xml#scala-api. Note that you need to include spark-xml jar when submitting your job.
var sparkConf = new SparkConf();
var sparkContext = new SparkContext(sparkConf);
var sqlContext = new SqlContext(sparkContext);
var df = sqlContext.Read()
.Format("com.databricks.spark.xml")
.Option("rowTag", "book")
.Load(@"D:\temp\spark-xml\books.xml");
var selectedData = df.Select("author", "@id");
selectedData.Write()
.Format("com.databricks.spark.xml")
.Option("rootTag", "books")
.Option("rowTag", "book")
.Save(@"D:\temp\spark-xml\newbooks.xml");