Search code examples
c#azure-data-lakeu-sql

U-SQL custom extractor on custom row delimiter and json


I have several text files with the following data structure:

{
huge 
json 
block that spans across multiple lines
}
--#newjson#--
{
huge 
json 
block that spans across multiple lines
}
--#newjson#--
{
huge 
json 
block that spans across multiple lines
} etc....

So it is actually json blocks that are row delimited by "--##newjson##--" string . I am trying to write a customer extractor to parse this. The problem is that I can't use string data type to feed json deserializer because it has a maximum size of 128 KB and the json blocks do not fit in this. What is the best approach to parse this file using a custom extractor?

I have tried using the code below, but it doesn't work. Even the row delimiter "--#newjson#--" doesn't seem to work right.

public SampleExtractor(Encoding encoding, string row_delim = "--#newjson#--", char col_delim = ';')
{
    this._encoding = ((encoding == null) ? Encoding.UTF8 : encoding);
    this._row_delim = this._encoding.GetBytes(row_delim);
    this._col_delim = col_delim;
}

public override IEnumerable<IRow> Extract(IUnstructuredReader input, IUpdatableRow output)
{ 
    //Read the input  by json
    foreach (Stream current in input.Split(_encoding.GetBytes("--#newjson#--")))
    {
        var serializer = new JsonSerializer();

        using (var sr = new StreamReader(current))
        using (var jsonTextReader = new JsonTextReader(sr))
        {
            var jsonrow = serializer.Deserialize<JsonRow>(jsonTextReader); 
            output.Set(0, jsonrow.status.timestamp);
        }
        yield return output.AsReadOnly();
    }
} 

Solution

  • Here is how you can achieve the solution:

    1) Create a c# equivalent of your JSON object Note:- Assuming all your json object are same in your text file. E.g:

    Json Code

    {
            "id": 1,
            "value": "hello",
            "another_value": "world",
            "value_obj": {
                "name": "obj1"
            },
            "value_list": [
                1,
                2,
                3
            ]
        }

    C# Equivalent

     public class ValueObj
        {
            public string name { get; set; }
        }
    
        public class RootObject
        {
            public int id { get; set; }
            public string value { get; set; }
            public string another_value { get; set; }
            public ValueObj value_obj { get; set; }
            public List<int> value_list { get; set; }
        }
    

    2) Change your de-serializing code like below after you have done the split based on the delimiter

    using (JsonReader reader = new JsonTextReader(sr))
    {
        while (!sr.EndOfStream)
        {
            o = serializer.Deserialize<List<MyObject>>(reader);
        }
    }

    This would deserialize the json data in c# class object which would solve your purpose. Later which you can serialize again or print it in text or ...any file.

    Hope it helps.