Search code examples
azure-data-factoryazure-data-lakeu-sql

Reading from specific files on U-SQL


I am new to U-SQL and have started querying files. Based on instructions that I have seen on documentation and here on stack overflow, I have a written a query for extracting metadata from a set of json files as below.

REFERENCE ASSEMBLY [Newtonsoft.Json];
REFERENCE ASSEMBLY [Microsoft.Analytics.Samples.Formats];

USING Microsoft.Analytics.Samples.Formats.Json;

DECLARE @InputFile string = "/test/{*}.json";
DECLARE @OutputFile string = "/metadata.csv";

@json =
EXTRACT
        id string,
        date DateTime,
        type string,
        uri = FILE.URI()
      , modified_date = FILE.MODIFIED()
      , created_date = FILE.CREATED()
      , file_sz = FILE.LENGTH()
FROM
    @InputFile
USING new JsonExtractor();

@json2 =
    SELECT 
    uri
    modified_date,
    created_date,
    file_sz
FROM @json;

@json3 =
    SELECT DISTINCT uri,
    modified_date,
    created_date,
    file_sz

    FROM @json2;
OUTPUT @json3
TO @OutputFile
USING Outputters.Csv(outputHeader:true,quoting:true);
DROP ASSEMBLY [Newtonsoft.Json];
DROP ASSEMBLY [Microsoft.Analytics.Samples.Formats];

This generates the required metadata( I run this on Azure portal even though here the locations are relative)

My questions are:

1) How can we use values from a column(a column of filenames) from an internal table / file as a list of file paths for input?

2) How can we append new data to an existing file and update the metadata file with list of new files.

My metadata looks like this:

uri           created_date       modified_date   file_sz
/…/abc.json      09-22-2018        09-23-2018       250
/…/del.json      09-24-2018        09-24-2018       126

EXPECTED OUTPUT (if possible)

@filespresent =
SELECT uri
FROM @metadata

DECLARE @Inputfile string = @filespresent

The main purpose of doing this is I get a new batch of files everyday and I want to read only the files that were uploaded today.

The filename do not contain date and the only way I can extract the date information is from inside the file. I extract metadata in a separate query and then run the main job on the files selected from the metadata file.

If there are other workaround to this, they are also most welcome.

Any help is appreciated.


Solution

  • The features you would like (e.g., reading the paths from a file, and appending to an existing file) is not available.

    Appending to a file you could do by reading the file, union the new data and then writing the result into the same file.

    However, by looking at your scenario of **The main purpose of doing this is I get a new batch of files everyday and I want to read only the files that were uploaded today. **

    You can do your EXTRACT as above and then put a filter onto the created_date or modified_date column to only select the files that are created or modified for the specific date. E.g. (typed into stackoverflow),

    DECLARE EXTERNAL @last_processed_modified_date = DateTime.Now();
    
    @json = EXTRACT
        id string,
        date DateTime,
        type string,
        uri = FILE.URI()
      , modified_date = FILE.MODIFIED()
      , created_date = FILE.CREATED()
      , file_sz = FILE.LENGTH()
    FROM @InputFile
    USING new JsonExtractor();
    
    @json = SELECT * FROM @json WHERE modified_date > @last_processed_modified_date;
    
    …