I am new to U-SQL and have started querying files. Based on instructions that I have seen on documentation and here on stack overflow, I have a written a query for extracting metadata from a set of json files as below.
REFERENCE ASSEMBLY [Newtonsoft.Json];
REFERENCE ASSEMBLY [Microsoft.Analytics.Samples.Formats];
USING Microsoft.Analytics.Samples.Formats.Json;
DECLARE @InputFile string = "/test/{*}.json";
DECLARE @OutputFile string = "/metadata.csv";
@json =
EXTRACT
id string,
date DateTime,
type string,
uri = FILE.URI()
, modified_date = FILE.MODIFIED()
, created_date = FILE.CREATED()
, file_sz = FILE.LENGTH()
FROM
@InputFile
USING new JsonExtractor();
@json2 =
SELECT
uri
modified_date,
created_date,
file_sz
FROM @json;
@json3 =
SELECT DISTINCT uri,
modified_date,
created_date,
file_sz
FROM @json2;
OUTPUT @json3
TO @OutputFile
USING Outputters.Csv(outputHeader:true,quoting:true);
DROP ASSEMBLY [Newtonsoft.Json];
DROP ASSEMBLY [Microsoft.Analytics.Samples.Formats];
This generates the required metadata( I run this on Azure portal even though here the locations are relative)
My questions are:
1) How can we use values from a column(a column of filenames) from an internal table / file as a list of file paths for input?
2) How can we append new data to an existing file and update the metadata file with list of new files.
My metadata looks like this:
uri created_date modified_date file_sz
/…/abc.json 09-22-2018 09-23-2018 250
/…/del.json 09-24-2018 09-24-2018 126
EXPECTED OUTPUT (if possible)
@filespresent =
SELECT uri
FROM @metadata
DECLARE @Inputfile string = @filespresent
The main purpose of doing this is I get a new batch of files everyday and I want to read only the files that were uploaded today.
The filename do not contain date and the only way I can extract the date information is from inside the file. I extract metadata in a separate query and then run the main job on the files selected from the metadata file.
If there are other workaround to this, they are also most welcome.
Any help is appreciated.
The features you would like (e.g., reading the paths from a file, and appending to an existing file) is not available.
Appending to a file you could do by reading the file, union the new data and then writing the result into the same file.
However, by looking at your scenario of **The main purpose of doing this is I get a new batch of files everyday and I want to read only the files that were uploaded today. **
You can do your EXTRACT
as above and then put a filter onto the created_date
or modified_date
column to only select the files that are created or modified for the specific date. E.g. (typed into stackoverflow),
DECLARE EXTERNAL @last_processed_modified_date = DateTime.Now();
@json = EXTRACT
id string,
date DateTime,
type string,
uri = FILE.URI()
, modified_date = FILE.MODIFIED()
, created_date = FILE.CREATED()
, file_sz = FILE.LENGTH()
FROM @InputFile
USING new JsonExtractor();
@json = SELECT * FROM @json WHERE modified_date > @last_processed_modified_date;
…