Search code examples
azure-databricksazure-batch-account

Load Data Using Azure Batch Service and Spark Databricks


I have File Azure Blob Storage that I need to load daily into the Data Lake. I am not clear on which approach I should use(Azure Batch Account, Custom Activity or Databricks, Copy Activity ). Please advise me.


Solution

  • To load files from blob storage to datalake, we can use Data Factory pipelines. Since the requirement is to do the copy every day, we must schedule a trigger.

    Schedule Triggers runs the pipeline periodically within a selected time. It uploads files or directory every time the pipeline is started. It replaces the previous copy in destination. So, any changes made on a particular day to that file in blob storage will be reflected in datalake after the next scheduled copy activity.

    You can also use Databricks notebook in a pipeline to do the same. The Databricks notebook contains the copy logic, and this notebook will be run every time the pipeline is triggered.

    You can follow these steps to perform the copy:

    • Open Data Factory studio, select “Author” tab. After opening this tab, you can see pipelines tab where you can create a new pipeline.

    • Give an appropriate to the name under properties tab. You can see different activities upon which you can create a pipeline. According to your requirement select either copy data from Move & transform tab or notebook from Databricks tab.

    • Create the necessary linked services (source and sink for copy activity, Databricks linked service for notebook).

    • After providing all the information, validate the pipeline to check for errors and publish it. Now add a new trigger by clicking on trigger option (Trigger now executes the pipeline only once). Specify all the details shown in the image below.

    enter image description here

    • The trigger will start and execute the pipeline periodically starting from the time mentioned above.

    The key factor is that you must schedule a trigger no matter which method you use, so that the pipeline recurs periodically as per your requirement (24 hours in your case).

    You can refer to the following docs: