Search code examples
azureazure-data-lakeu-sql

Incremental load in Azure Data Lake


I have a big blob storage full of log files organized according to their identifiers at a number of levels: repository, branch, build number, build step number.

These are JSON files that contain an array of objects, each object has a timestamp and an entry value. I've already implemented a custom extractor (extending IExtractor) that takes an input stream and produces a number of plain-text lines.

Initial load

Now I am trying to load all of that data to ADL Store. I created a query that looks similar to this:

@entries =
  EXTRACT
    repo string,
    branch string,
    build int,
    step int,
    Line int,
    Entry string
  FROM @"wasb://my.blob.core.windows.net/{repo}/{branch}/{build}/{step}.json"
  USING new MyJSONExtractor();

When I run this extraction query I get a compiler error - it exceeds the limit of 25 minutes of compilation time. My guess is: too many files. So I add a WHERE clause in the INSERT INTO query:

INSERT INTO Entries
(Repo, Branch, Build, Step, Line, Entry)
SELECT * FROM @entries
WHERE (repo == "myRepo") AND (branch == "master");

Still no luck - compiler times out.

(It does work, however, when I process a single build, leaving {step} as the only wildcard, and hard-coding the rest of names.)

Question: Is there a way to perform a load like that in a number of jobs - but without the need to explicitly (manually) "partition" the list of input files?

Incremental load

Let's assume for a moment that I succeeded in loading those files. However, a few days from now I'll need to perform an update - how am I supposed to specify the list of files? I have a SQL Server database where all the metadata is kept, and I could extract exact log file paths - but U-SQL's EXTRACT query forces me to provide a static string that specifies the input data.

A straightforward scenario would be to define a top-level directory for each date and process them day by day. But the way the system is designed makes this very difficult, if not impossible.

Question: Is there a way to identify files by their creation time? Or maybe there is a way to combine a query to a SQL Server database with the extraction query?


Solution

  • To address your second question:

    You could read your data from the SQL Server database using a federated query, and then use the information in a join with the virtual columns that you create from the fileset. The problem with that is that the values are only known at execution time and not at compile time, so you would not get the reduction in the accessed files.

    Alternatively, you could write a SQL query that gets you the data you need and then parameterize your U-SQL script so you can pass that information into the U-SQL script.

    As to the ability to select files based on their creation time: This is a feature on our backlog. I would recommend to upvote and add a comment to the following feature request: https://feedback.azure.com/forums/327234-data-lake/suggestions/10948392-support-functionality-to-handle-file-properties-fr and add a comment you want to also query on them over a fileset.