Search code examples
jsonazure-data-factoryazure-data-lakeazure-data-lake-gen2

Azure Data Factory - extracting information from Data Lake Gen 2 JSON files


I have an ADF pipeline loading raw log data as JSON files into a Data Lake Gen 2 container.

We now want to extract information from those JSON files and I am trying to find the best way to get information from said files. I found that Azure Data Lake Analytics and U-SQL scripts are pretty powerful and also cheap, but they require a steep learning curve.

Is there a recommended way to parse JSON files and extract information from them? Would Data Lake tables be an adequate storage for this extracted information and act then as a source for downstream reporting process?

And finally, will Azure Data Factory ever be able to parse nested arrays JSONs?


Solution

  • We can parse JSON files and extract information via data flow. We can parse nested arrays JSONs via Flatten transformation in mapping data flow.

    Json example:

        {   
            "count": 1,
            "value": [{
                        "obj": 123,
                        "lists": [{
                                    "employees": [{
                                        
                                            "name": "",
                                            "id": "001",
                                            "tt_1": 0,
                                            "tt_2": 4,
                                            "tt3_": 1
                                        },
                                        {
                                            "name": "",
                                            "id": "002",
                                            "tt_1": 10,
                                            "tt_2": 8,
                                            "tt3_": 1
                                        }]
                                }]
                        }]                  
        }
    

    enter image description here Flatten active settings and output preview: enter image description here

    enter image description here

    Mapping data flow follows an extract, load, and transform (ELT) approach and works with staging datasets that are all in Azure. Currently, the following datasets can be used in a source transformation. enter image description here

    So I think using data flow in ADF is the easiest way to extract information and act then as a source for downstream reporting process.