Search code examples
azure-data-factoryazure-data-lake

Getting data from REST API and storing it in Azure Data Lake


Does it make sense to get data from REST API and store it as JSON in an Azure Data Lake? Or the data should be stored directly into Azure SQL?

I've tried both options, but it's not clear in which scenario it is worth to save the data into Azure Data Lake.


Solution

  • Yes this is a perfectly normal pattern that has emerged for collecting large volumes in particular. Writing to a database is great but there are (at least) two aspects to consider:

    1. schema-on-write - you have to know the schema before you write to the database. That means all columns, all datatypes, nullability, collation even before you can even think about writing a record. How are you going to handle the schema of your JSON changing for example?
    2. transaction logging - most Microsoft SQL databases work with write-ahead-log or WAL, which means the transaction logging has to complete before the transaction is considered complete as part of an ACID transaction. What will happen in situations of heavy load on the database or high concurrency - queuing and blocking. Often these things take milliseconds but low tiers etc come into play. Alternate patterns like eventual consistency eg with Cosmos are a possibility if you need that sort of thing.

    Data Lakes in contract are schema-on-read, ie you do not have to know the schema in order to write to the lake, so you can just land it and figure out the other stuff later.

    This does not necessarily apply to your other question about Synapse as you run the risk of losing your perfectly good SQL Server datatypes. Look at one of the migration wizards for that instead.