Search code examples
u-sql

MultiLevelJsonExtractor - Extract the desired level


I have a JSON document that looks as follows:

{
  "Region": "Main",
  "MarketLocations": [
    {
      "MarketName": "Central",
      "MarketId": 1,
      "SalesCategories": {
        "Produce": [
          {
            "Type": "Apple",
            "Name": "Granny Smith",
            "DatePicked": "2016-11-08T14:14:33.712Z",
            "ShelfLifeInDays": 24,
            "Calories": 45,
            "Price": 0.29
          }
        ],
        "BakedGoods": [
          {
            "DateMade": "2016-11-08T14:14:33.712Z",
            "Name": "Apple Pie",
            "Price": 14.25
          }
        ],
        "RestaurantItems": [
          {
            "Name": "Turkey Sandwich",
            "Price": 4.85,
            "PreparationTimeInMinutes": 20
          }
        ],
        "NonPerishable": [
          {
            "Name": "Honey Mustard",
            "Type": "Condiments"
          }
        ]
      }
    },
    {
      "MarketName": "Southern",
      "MarketId": 2,
      "SalesCategories": {
        "Produce": [
          {
            "Type": "Apple",
            "Name": "Granny Smith",
            "DatePicked": "2016-11-08T14:14:33.712Z",
            "ShelfLifeInDays": 24,
            "Calories": 45,
            "Price": 0.29
          },
          {
            "Type": "Plums",
            "Name": "Red Plums",
            "DatePicked": "2016-11-08T14:14:33.712Z",
            "ShelfLifeInDays": 12,
            "Calories": 21,
            "Price": 0.33
          },
          {
            "Type": "Pears",
            "Name": "Golden Nature",
            "DatePicked": "2016-11-08T14:14:33.712Z",
            "ShelfLifeInDays": 20,
            "Calories": 40,
            "Price": 0.45
          }
        ],
        "BakedGoods": [
          {
            "DateMade": "2016-11-08T14:14:33.712Z",
            "Name": "Apple Pie",
            "Price": 14.25
          }
        ],
        "RestaurantItems": [
          {
            "Name": "Turkey Sandwich",
            "Price": 4.85,
            "PreparationTimeInMinutes": 20
          }
        ],
        "NonPerishable": [
          {
            "Name": "Honey Mustard",
            "Type": "Condiments"
          }
        ]
      }
    },
    {
      "MarketName": "Western",
      "MarketId": 3,
      "SalesCategories": {
        "Produce": [
          {
            "Type": "Plums",
            "Name": "Red Plums",
            "DatePicked": "2016-11-08T14:14:33.712Z",
            "ShelfLifeInDays": 12,
            "Calories": 21,
            "Price": 0.33
          },
          {
            "Type": "Pears",
            "Name": "Golden Nature",
            "DatePicked": "2016-11-08T14:14:33.712Z",
            "ShelfLifeInDays": 20,
            "Calories": 40,
            "Price": 0.45
          }
        ],
        "BakedGoods": [
          {
            "DateMade": "2016-11-08T14:14:33.712Z",
            "Name": "Plum Pie",
            "Price": 18.25
          }
        ],
        "RestaurantItems": [
          {
            "Name": "Ham Sandwich",
            "Price": 4.85,
            "PreparationTimeInMinutes": 20
          },
          {
            "Name": "Chicken Soup",
            "Price": 2.25,
            "PreparationTimeInMinutes": 5
          }
        ],
        "NonPerishable": [
          {
            "Name": "Mayo",
            "Type": "Condiments"
          },
          {
            "Name": "Syrup",
            "Type": "Condiments"
          },
          {
            "Name": "Ginger",
            "Type": "Spices"
          }
        ]
      }
    }
  ]
}

I have the following U-SQL, that processes this JSON file, running inside Visual Studio:

DECLARE @in string=@"/JsonDoc2.json";
DECLARE @out string=@"Output/JsonDoc2.csv";

    @produce =
        EXTRACT Name string,
                DatePicked DateTime,
                ShelfLifeInDays int,
                Calories int,
                Price decimal,
                MarketId string,
                MarketName string
        FROM @in
        USING new MultiLevelJsonExtractor("MarketLocations[*].SalesCategories.Produce[*]",
              false,
              "Name",
              "DatePicked",
              "ShelfLifeInDays",
              "Calories",
              "Price",
              "MarketId",
              "MarketName");


    OUTPUT @produce
    TO @out
    USING Outputters.Csv(outputHeader : true);

This executes without error. The problem is that I am specifically specifying what sales category I want ('produce'). I'd like to change this query so that that all sales categories are included (produce, baked goods etc.) with the category name included. I've not been able to figure out a way to do this.


Solution

  • The JsonType method of the NewtonSoft JsonFunctions class, returns a MAP value which is a key-value pair. You can then reference the key to get the JSON property / object / array names, at least after a few other manipulations with CROSS APPLY and EXPLODE.

    For your example, I got the following to work:

    REFERENCE ASSEMBLY [Newtonsoft.Json];
    REFERENCE ASSEMBLY [Microsoft.Analytics.Samples.Formats];
    
    USING Microsoft.Analytics.Samples.Formats.Json;
    
    DECLARE @input string = @"/input/myinputfile.json";
    DECLARE @output string = @"output/output.csv";
    
    @json =
        EXTRACT Region string,
                MarketName string,
                SalesCategories string // get the SalesCategories as JSON
        FROM @input
        USING new MultiLevelJsonExtractor("MarketLocations[*].SalesCategories",
              true,
              "Region",
              "MarketName",
              "SalesCategories"
              );
    
    
    // Convert the json string to tuple/MAP
    @working =
        SELECT Region,
               MarketName,
               JsonFunctions.JsonTuple(SalesCategories) AS x
        FROM @json;
    
    
    // Explode the tuple as key-value pair;
    @working =
        SELECT Region,
               MarketName,
               key,
               value
        FROM @working
             CROSS APPLY
                 EXPLODE(x) AS y(key, value);
    
    
    // Explode the value which is JSON
    @working =
        SELECT Region,
               MarketName,
               key,
               JsonFunctions.JsonTuple(y) AS z
        FROM @working
            CROSS APPLY
                 EXPLODE(JsonFunctions.JsonTuple(value).Values) AS x(y);
    
    
    // Prep the result, naming the items you want
    @result =
        SELECT Region,
               MarketName,
               key,
               z["Type"] AS Type,
               z["Name"] AS Name,
               z["DatePicked"] AS DatePicked,
               z["ShelfLifeInDays"] AS ShelfLifeInDays,
               z["Calories"] AS Calories,
               z["Price"] AS Price,
               z["DateMade"] AS DateMade,
               z["PreparationTimeInMinutes"] AS PreparationTimeInMinutes
        FROM @working;
    
    
    OUTPUT @result
    TO @output
    USING Outputters.Csv(quoting:false);
    

    My results: Results

    It feels like it could be simplified, but see how you get on. The samples for shredding JSON are in short supply but try here and here.