Processing XML files with Spark and C#

I'm working on a system that that will be acting as an OLAP engine for a simulation toolchain dataset. The tools generate their results in XML.

The easiest and most simple solution to me would have been to simply use spark-xml to access the XML files directly with python, Scala, etc. But the problem is that the project owners want to use C# as that is what the original simulation toolchain is built in. I know there is SparkCLR for C# but I don't know of a good way of using Spark-XML within C#.

Does anyone have any suggestions on how to do this? If not I guess the next option would be to translate the datasets into something more native for SparkCLR but not sure of the best approach.

Solution

SparkCLR works with spark-xml. The following code shows how to use C# to process XML as Spark DataFrame. You can use this code sample to start building your XML processing C# application for Spark. This sample implements the same example available at https://github.com/databricks/spark-xml#scala-api. Note that you need to include spark-xml jar when submitting your job.

        var sparkConf = new SparkConf();
        var sparkContext = new SparkContext(sparkConf);
        var sqlContext = new SqlContext(sparkContext);

        var df = sqlContext.Read()
            .Format("com.databricks.spark.xml")
            .Option("rowTag", "book")
            .Load(@"D:\temp\spark-xml\books.xml");
        var selectedData = df.Select("author", "@id");
        selectedData.Write()
            .Format("com.databricks.spark.xml")
            .Option("rootTag", "books")
            .Option("rowTag", "book")
            .Save(@"D:\temp\spark-xml\newbooks.xml");