Search code examples
azureazure-data-lakeu-sql

How do you define a date range and then query across a complex file set in USQL?


There are many great examples of USQL across single files. But, how would you replicate a very common data processing example where you want to take the current system time, subtract X number of days from that time and query a set of data based on that result? For a SQL example:

SELECT * FROM MyTable
WHERE Date >= CAST(GETDATE() AS DATE) - 30
AND Date <= CAST(GETDATE() AS DATE) - 1

In this above example, my Dates is my file location such as:

'yyyy' | 'MM' | 'DD' | Filename.csv

-- Example path
/MyDirectory/2017/12/01/SomeData.csv

Therefore, is there a way in USQL with Azure Data Lake Analytics to do similar, but with the file location instead of querying everything with "{date:yyyy}/{date:MM}/{date:dd}/" expressions?

If that's not possible, what about specifing a range at least like:

"/MyDirectory/2017/{10-12}/{1-30}/{filename:*}.csv"

I can combine all files to one directory and use the natural date fields in the data to filter with a SELECT statement after the extractor, but the point of the directory structure is to reduce un-needed reads (transactions) and only targeting specific directories needed for a query based on the date of the said file itself.


Solution

  • Maya is correct. There are examples at U-SQL Language Reference and more specifically at EXTRACT Expression (U-SQL). See the example under "Multiple directories with multiple files". Here are some modifications to that example that appear to satisfy your ask.

    1) The example is missing DECLARE @dir string = "/Samples/Data/AmbulanceData/";
    2) Modify the DECLARE @file_set_path2 to read DECLARE @file_set_path2 string = @dir + "{date:yyyy}/{date:MM}/{date:dd}/vehicle{vid}_{*}.csv";
    3. For your filter you could use WHERE date >= DateTime.Now.AddDays(-30) AND date <= DateTime.Now.AddDays(-1)