Search code examples
c#jsonazureazure-data-lakeu-sql

U-SQL Ignore rows causing JSON exception in JsonFunctions.JsonTuple() after using Text Extractor


I'm trying to convert JSON files into csv, extracting some specific key/values of interest. I have large text files (> 1 million lines) where each line is a JSON object. The actual structure is nested and contains arrays, but it shouldn't be important for this question.

Example file:

{"param1": val00, "param2": val01}
{"param1": val10, "param2": val11}
...
{"param1": valn0, "param2": valn1}

I'm using the following code with Text Extractor to convert each line to a JSON object, and subsequently parsing the JSON object using JsonFunctions.JsonTuple()

REFERENCE ASSEMBLY DdaAdlDb.[Newtonsoft.Json];
REFERENCE ASSEMBLY DdaAdlDb.[Microsoft.Analytics.Samples.Formats];

USING Microsoft.Analytics.Samples.Formats.Json;

@RawExtract = 
    EXTRACT RawString string
    FROM @InputFile 
    USING Extractors.Text(delimiter:'\b', quoting:false);

@json =
    SELECT JsonFunctions.JsonTuple(RawString, "..*") AS RootObject
    FROM @RawExtract;

I run into problems because at least one of the lines is corrupt, as follows:

{"param1": val00, "param2"{"param1": val10, "param2": val11}
...
{"param1": valn0, "param2": valn1}

These are the error messages I receive

VertexFailedFast: Vertex failed with a fail-fast error E_RUNTIME_USER_EXPRESSIONEVALUATION Error while evaluating expression Microsoft.Analytics.Samples.Formats.Json.JsonFunctions.JsonTuple(RawString, "..*") Inner exception from user expression: After parsing a value an unexpected character was encountered: {.

==== Caught exception Newtonsoft.Json.JsonReaderException at Newtonsoft.Json.JsonTextReader.ParsePostValue(Boolean ignoreComments) at Newtonsoft.Json.JsonTextReader.Read() at Newtonsoft.Json.Linq.JContainer.ReadContentFrom(JsonReader r, JsonLoadSettings settings) at Newtonsoft.Json.Linq.JContainer.ReadTokenFrom(JsonReader reader, JsonLoadSettings options) at Newtonsoft.Json.Linq.JObject.Load(JsonReader reader, JsonLoadSettings settings) at Newtonsoft.Json.Linq.JToken.ReadFrom(JsonReader reader, JsonLoadSettings settings) at Newtonsoft.Json.Linq.JToken.Parse(String json, JsonLoadSettings settings) at Microsoft.Analytics.Samples.Formats.Json.JsonFunctions.JsonTuple[T](String json, String[] paths) at Microsoft.Analytics.Samples.Formats.Json.JsonFunctions.JsonTuple(String json, String[] paths) at _Scope_Generated_Classes_.SqlFilterTransformer_13.Process(IRow inRow, IUpdatableRow outRow) in d:\data\yarnnm\local\usercache\45905a29-60cf-4dd2-bf9c-c2b31bedea60\appcache\application_1546207842059_5159256\container_e346_1546207842059_5159256_01_000001\wd__ScopeCodeGen__.dll.cs:line 233

I don't mind losing a few lines. I see that for extractors I can use silent: true to ignore bad rows. Is there a similar option for the JsonFunctions.JsonTuple() parsing function?


Solution

  • It was possible to create a try-catch block within the U-SQL script to avoid crash on JsonReaderException.

    Try-catch block

    // Ignore rows with invalid JSON formattting
    DECLARE @TryParseJson Func<string, SqlMap<string, string>> = (RawString)=>{
        try 
        {
            return JsonFunctions.JsonTuple(RawString, "..*");
        }
        catch (Newtonsoft.Json.JsonReaderException e)
        {
            return null;  // Consider returning empty SqlMap<string, string> instead
        }
    };
    

    Extract and parse JSON

    Each line contains a JSON string

    @RawExtract = 
        EXTRACT RawString string
        FROM @InputFile 
        USING Extractors.Text(delimiter:'\b', quoting:false);
    
    @json =
        SELECT @TryParseJson(RawString) AS RootObject
        FROM @RawExtract;
    

    Ignore rows with invalid JSON (null value)

    @result =
        SELECT RootObject["id"] AS Id
             , RootObject["status"] AS Status
             , RootObject["time"] AS Time
        FROM @json
        WHERE RootObject != null;  // Ignore invalid JSON