Search code examples
c#apache-sparkazure-hdinsightlivyspark-dotnet

Submit a Spark job from C# and get results


As per title, I would like to request a calculation to a Spark cluster (local/HDInsight in Azure) and get the results back from a C# application.

I acknowledged the existence of Livy which I understand is a REST API application sitting on top of Spark to query it, and I have not found a standard C# API package. Is this the right tool for the job? Is it just missing a well known C# API?

The Spark cluster needs to access Azure Cosmos DB, therefore I need to be able to submit a job including the connector jar library (or its path on the cluster driver) in order for Spark to read data from Cosmos.


Solution

  • As a .NET Spark connector to query data did not seem to exist I wrote one

    https://github.com/UnoSD/SparkSharp

    It is just a quick implementation, but it does have also a way of querying Cosmos DB using Spark SQL

    It's just a C# client for Livy but it should be more than enough.

    using (var client = new HdInsightClient("clusterName", "admin", "password"))
    using (var session = await client.CreateSessionAsync(config))
    {
        var sum = await session.ExecuteStatementAsync<int>("val res = 1 + 1\nprintln(res)");
    
        const string sql = "SELECT id, SUM(json.total) AS total FROM cosmos GROUP BY id";
    
        var cosmos = await session.ExecuteCosmosDbSparkSqlQueryAsync<IEnumerable<Result>>
        (
            "cosmosName",
            "cosmosKey",
            "cosmosDatabase",
            "cosmosCollection",
            "cosmosPreferredRegions",
            sql
        );
    }