Search code examples
jsonazureu-sqlazure-data-factory

How to standardize the output of USQL to have data for all the columns when converted from JSON


How to standardize the output of USQL to have data for all the columns when converted from JSON

We have a requirement to standardize the output of USQL. The USQL reads the JSON (source file) data and convert it to csv format. The problem is that the number of columns we have for each row in the csv is not the same because of missing data in the JSON. Sometimes the result set of USQL have a row in csv with "N" columns, another row is with "N+1" columns (cells). We would like to standardize the output, having the same number columns in csv for all the rows. How do we achieve this? We don't have any control over source file data, we would need to do the standardization while processing. Has anyone faced similar challenges and found a solution? Thanks for your help!

Input details :

{"map": {"key1": 100,    "key2": 101,    "key3": 102},  "id": 2,  "time": 1540300241230}
{"map": {"key1": 200,    "key2": 201,    "key3": 202    "key4": 203},  "id": 2,  "time": 1540320246930}
{"map": {"key1": 300,    "key3": 301,    "key4": 303},  "id": 2,  "time": 1540350246930}

Once the above JSON is converted to CSV based on some calculation

Output as is which is not correct

key1, key2, key3, key4

100, 101, 102
200, 201, 202, 203
300, 301, 303

Value "301" is not associated with the key2

Output expected - # is the default for missing column values

key1, key2, key3, key4

100, 101, 102,  #
200, 201, 202, 203
300, #, 301, 303

Later all the headings ( key1, key2..) will be replaced with actual header names ( Pressure, Velocity...etc)


Solution

  • USE DATABASE [ADLSDB];

    DECLARE EXTERNAL @INPUT_FILE string = "/adlspath/keyValue.txt"; DECLARE EXTERNAL @PIVOT_FILE string = "/adlspath/pivot.txt";

    /* The meta data about the request starts - contents of the file request.json */

    @requestData = EXTRACT id int, timestamp string, key string, value int FROM @INPUT_FILE USING Extractors.Csv(); @data = SELECT id AS id, timestamp AS timestamp, key AS key, value AS value FROM @requestData;

    DECLARE EXTERNAL @ids string = "key1,key2,key3,key4"; //"external declaration"

    @result = SELECT * FROM (SELECT id, timestamp, key, value FROM @data ) AS D PIVOT(SUM(value) FOR key IN(@ids AS heading)) AS P;

    OUTPUT @result TO @PIVOT_FILE USING Outputters.Csv(quoting:false, outputHeader:false);

    I was able to get close to the solution by using the above code, however I am stuck at passing multiple values to the IN clause. The list of @ids, I will get at compile time of the USQL, but passing it as a comma separated scalar variable does not produce the result. If I pass only one value ( assume key1) then the IN condition matches and output the rows for Key1. Anyone knows how to pass multiple values to IN clause in USQL PIVOT function.

    ------Updated------------

    We were able to solve the problem by using dynamic USQL. One USQL will write the USQL statements to the output in required format. Then another data factory activity will read the dynamically generated USQL.