Search code examples
c#xmlapache-sparkmobius

Processing XML files with Spark and C#


I'm working on a system that that will be acting as an OLAP engine for a simulation toolchain dataset. The tools generate their results in XML.

The easiest and most simple solution to me would have been to simply use spark-xml to access the XML files directly with python, Scala, etc. But the problem is that the project owners want to use C# as that is what the original simulation toolchain is built in. I know there is SparkCLR for C# but I don't know of a good way of using Spark-XML within C#.

Does anyone have any suggestions on how to do this? If not I guess the next option would be to translate the datasets into something more native for SparkCLR but not sure of the best approach.


Solution

  • SparkCLR works with spark-xml. The following code shows how to use C# to process XML as Spark DataFrame. You can use this code sample to start building your XML processing C# application for Spark. This sample implements the same example available at https://github.com/databricks/spark-xml#scala-api. Note that you need to include spark-xml jar when submitting your job.

            var sparkConf = new SparkConf();
            var sparkContext = new SparkContext(sparkConf);
            var sqlContext = new SqlContext(sparkContext);
    
            var df = sqlContext.Read()
                .Format("com.databricks.spark.xml")
                .Option("rowTag", "book")
                .Load(@"D:\temp\spark-xml\books.xml");
            var selectedData = df.Select("author", "@id");
            selectedData.Write()
                .Format("com.databricks.spark.xml")
                .Option("rootTag", "books")
                .Option("rowTag", "book")
                .Save(@"D:\temp\spark-xml\newbooks.xml");