I am currently working on a project where we have data stored on Azure Datalake. The Datalake is hooked to Azure Databricks.
The requirement asks that the Azure Databricks is to be connected to a C# application to be able to run queries and get the result all from the C# application. The way we are currently tackling the problem is that we have created a workspace on Databricks with a number of queries that need to be executed. We created a job that is linked to the mentioned workspace. From the C# application we are calling a number of API's listed here in this documentation to call an instance of the job and wait for it to be executed. However I have not been able to extract the result from any of the APIs listed in the documentation.
My question is this, are we taking the correct approach or is there something we are not seeing? If this is the way to go, what has been your experience in extracting the result from a successfully run job on Azure Databricks from a C# application.
Microsoft has a nice architecture reference solution that might help you get some more insights too.
I'm not sure using the REST API is the best way to go to get your job output from Azure DataBricks.
First of all the REST API has a rate limit per databrick instance. It's not that bad at 30 requests per second but it strongly depend on the scale of your application and other uses of the databrick instance if that is sufficient. It should be enough for creating a job but if you want to poll the job status for completion it might not be enough.
There is also a limited capacity in datatransfer via the REST API. For example: As per the docs the output api will only returns the first 5MB of a run output. If you want larger results you'll have to store it somewhere else before getting it from your C# application.
Alternative retrieval method
In Short: Use Azure PaaS to your advantage with blobstorage and eventgrid.
This is in no way an exhaustive solution and I'm sure someone can come up with a better one, however this has worked for me in similar usecases.
What you can do is write the result from your job runs to some form of cloud storage connected to databricks and then get the result from that storage location later. There is a step in this tutorial that shows the basic concept for storing the results of a job with SQL data warehouse, but you can use any storage you like, for example Blob storage
Let's say you store the result in blobstorage. Each time a new job output is written to a blob, you can raise an event. You can subscribe to these events via Azure Eventgrid and consume them in your application. There is a .net SDK that will let you to do this. The event will contain a blob uri that you can use to get the data into your application.
Form the docs a blobcreated event will look something like this:
[{
"topic": "/subscriptions/{subscription-id}/resourceGroups/Storage/providers/Microsoft.Storage/storageAccounts/my-storage-account",
"subject": "/blobServices/default/containers/test-container/blobs/new-file.txt",
"eventType": "Microsoft.Storage.BlobCreated",
"eventTime": "2017-06-26T18:41:00.9584103Z",
"id": "831e1650-001e-001b-66ab-eeb76e069631",
"data": {
"api": "PutBlockList",
"clientRequestId": "6d79dbfb-0e37-4fc4-981f-442c9ca65760",
"requestId": "831e1650-001e-001b-66ab-eeb76e000000",
"eTag": "\"0x8D4BCC2E4835CD0\"",
"contentType": "text/plain",
"contentLength": 524288,
"blobType": "BlockBlob",
"url": "https://my-storage-account.blob.core.windows.net/testcontainer/new-file.txt",
"sequencer": "00000000000004420000000000028963",
"storageDiagnostics": {
"batchId": "b68529f3-68cd-4744-baa4-3c0498ec19f0"
}
},
"dataVersion": "",
"metadataVersion": "1"
}]
It will be important to name your blobs with the required information such as job Id and Run Id. You can also create custom events, which will increase the complexity of the solution but will allow you to add more details to your event.
Once you have the blob created event data in your app you can use the storage SDK to get the blobdata for use in your application. Depending on your application logic, you'll also have to manage the job ID and run Id's in the application otherwise you run the risk of having job output in your storage that is no longer attached to a process in your app.