I'm trying to convert JSON files into csv, extracting some specific key/values of interest. I have large text files (> 1 million lines) where each line is a JSON object. The actual structure is nested and contains arrays, but it shouldn't be important for this question.
Example file:
{"param1": val00, "param2": val01}
{"param1": val10, "param2": val11}
...
{"param1": valn0, "param2": valn1}
I'm using the following code with Text Extractor to convert each line to a JSON object, and subsequently parsing the JSON object using JsonFunctions.JsonTuple()
REFERENCE ASSEMBLY DdaAdlDb.[Newtonsoft.Json];
REFERENCE ASSEMBLY DdaAdlDb.[Microsoft.Analytics.Samples.Formats];
USING Microsoft.Analytics.Samples.Formats.Json;
@RawExtract =
EXTRACT RawString string
FROM @InputFile
USING Extractors.Text(delimiter:'\b', quoting:false);
@json =
SELECT JsonFunctions.JsonTuple(RawString, "..*") AS RootObject
FROM @RawExtract;
I run into problems because at least one of the lines is corrupt, as follows:
{"param1": val00, "param2"{"param1": val10, "param2": val11}
...
{"param1": valn0, "param2": valn1}
These are the error messages I receive
VertexFailedFast: Vertex failed with a fail-fast error E_RUNTIME_USER_EXPRESSIONEVALUATION Error while evaluating expression Microsoft.Analytics.Samples.Formats.Json.JsonFunctions.JsonTuple(RawString, "..*") Inner exception from user expression: After parsing a value an unexpected character was encountered: {.
==== Caught exception Newtonsoft.Json.JsonReaderException at Newtonsoft.Json.JsonTextReader.ParsePostValue(Boolean ignoreComments) at Newtonsoft.Json.JsonTextReader.Read() at Newtonsoft.Json.Linq.JContainer.ReadContentFrom(JsonReader r, JsonLoadSettings settings) at Newtonsoft.Json.Linq.JContainer.ReadTokenFrom(JsonReader reader, JsonLoadSettings options) at Newtonsoft.Json.Linq.JObject.Load(JsonReader reader, JsonLoadSettings settings) at Newtonsoft.Json.Linq.JToken.ReadFrom(JsonReader reader, JsonLoadSettings settings) at Newtonsoft.Json.Linq.JToken.Parse(String json, JsonLoadSettings settings) at Microsoft.Analytics.Samples.Formats.Json.JsonFunctions.JsonTuple[T](String json, String[] paths) at Microsoft.Analytics.Samples.Formats.Json.JsonFunctions.JsonTuple(String json, String[] paths) at _Scope_Generated_Classes_.SqlFilterTransformer_13.Process(IRow inRow, IUpdatableRow outRow) in d:\data\yarnnm\local\usercache\45905a29-60cf-4dd2-bf9c-c2b31bedea60\appcache\application_1546207842059_5159256\container_e346_1546207842059_5159256_01_000001\wd__ScopeCodeGen__.dll.cs:line 233
I don't mind losing a few lines. I see that for extractors I can use silent: true
to ignore bad rows. Is there a similar option for the JsonFunctions.JsonTuple()
parsing function?
It was possible to create a try-catch block within the U-SQL script to avoid crash on JsonReaderException
.
Try-catch block
// Ignore rows with invalid JSON formattting
DECLARE @TryParseJson Func<string, SqlMap<string, string>> = (RawString)=>{
try
{
return JsonFunctions.JsonTuple(RawString, "..*");
}
catch (Newtonsoft.Json.JsonReaderException e)
{
return null; // Consider returning empty SqlMap<string, string> instead
}
};
Extract and parse JSON
Each line contains a JSON string
@RawExtract =
EXTRACT RawString string
FROM @InputFile
USING Extractors.Text(delimiter:'\b', quoting:false);
@json =
SELECT @TryParseJson(RawString) AS RootObject
FROM @RawExtract;
Ignore rows with invalid JSON (null value)
@result =
SELECT RootObject["id"] AS Id
, RootObject["status"] AS Status
, RootObject["time"] AS Time
FROM @json
WHERE RootObject != null; // Ignore invalid JSON