Search code examples
databaseazure-databricksdata-storageazure-data-lake-gen2

What is the best way to archive structured data in low cost storage that can be accessed using API?


I have a few hundred GBs of data in a MySQL database that I would like to archive in a low-cost cloud storage. At the same time, I would still like the data to be easily accessible and retrieve using API. I am looking for a technology-independent solution (not tied to any particular database engine).

The frequency of retrieval is expected to be very low but I would like to keep the "cost" (that is the effort) of retrieval to be low as well. I can live with a higher latency to retrieve the data.

I was thinking of using Spark/Databricks (to provide the structured API interface) with long term storage solutions such as Azure Data Lake Storage Gen2 (to provide the low cost storage).

Is this the optimal solution or are there better alternatives?


Solution

  • Yes, Azure Data Lake Storage Gen2 (ADLS2) is the best cloud service to store the data in archive with low cost.

    enter image description here

    Check the detailed pricing here.

    To archive the data in ADLS2, while creating the storage account, in Advanced tab select Cool access tier. Also check mark Enable hierarchical namespace option.

    Advanced -> Access tier -> Cool

    enter image description here

    You can easily authenticate and access Azure Data Lake Storage Gen2 (ADLS Gen2) storage accounts using an Azure storage account access key. Apache Spark provide multiple APIs (RDD, DataFrame, DataSet) which can help you to access and transform the data as per your requirement.

    Refer to learn more about APIs: Databricks with ADLS Gen2, RDDs vs DataFrames and Datasets

    Note: To retrieve the data you must change the ADLS access tier to Hot.