azure azure-blob-storage data-warehouse azure-data-lake

Data Lake Blob Storage

I'm after a bit of understanding, I'm not stuck on anything but I'm trying to understand something better.

When loading a data warehouse why is it always suggested that we load data into blob storage or a data lake first? I understand that it's very quick to pull data from there, however in my experience there are a couple of pitfalls. The first is that there is a file size limit and if you load too much data into 1 file as I've seen happen it causes the load to error at which point we have to switch the load to incremental. This brings me to my second issue, I always thought the point of loading into blob storage was to chuck all the data in there so you can access it in the future without stressing the front end systems, if I can't do that because of file limits then what's the point of even using blob storage, we might as well load data straight into staging tables. It just seems like an unnecessary step to me when I've ran data warehouses in the past without this part involved and to me they have worked better.

Anyway my understanding of this part is not as good as I'd like it to be, and I've tried finding articles that answer these specific questions but none have really explained the concept to me correctly. Any help or links to good articles I could read would be much appreciated.

Solution

One reason for placing the data in blob or data lake is so that multiple parallel readers can be used on the data at the same time. The goal of this is to read the data in a reasonable time. Not all data sources support such type of read operations. Given the size of your file, a single reader would take a long long time.

One such example could be SFTP. Not all SFTP servers support offset reads. Some may have further restrictions on concurrent connections. Moving the data first to Azure services provides a known set of capabilities / limitation.

In your case, I think what you need, is to partition the file, like what HDFS might do. If I knew what data source you are using, I could have a further suggestion.