Search code examples
arraysjsonpentaho-spoonpentaho-data-integrationpdi

How to Read a multi nested JSON file with Pentaho Spoon 6 JSON Input


I have a JSON file with several sub-levels that I need to parse with PDI. The tricky bit that is throwing me for a loop is the one to many relationship in the nested array.

Here is a sample of my JSON:

{
    "@gdata.count": "139111",
    "value": [
        {
            "InvStatus": {
                "Description": "Active"
            },
            "DeviceAddresses": [],
            "People": [],
            "Id": "11",
            "InvHostName": "NYCRMR-TWX9000",
            "InvDomain": "blah.masked.com",
            "InvSerialNumber": "86753091",
            "InvDevType": "Workstation",
            "InvRegion": "NA",
            "InvLocation": "114AP2",
            "InvNetwork": "WMG",
            "InvBrand": "RMR",
            "InvProtectionStatus": null,
            "InvConversionStatus": null,
            "InvDeviceDob": "2023-10-23"
        }, {
            "InvStatus": {
                "Description": "Disconnected"
            },
            "DeviceAddresses": [],
            "People": [],
            "Id": "37",
            "InvHostName": "NYCRMR-TWX9002",
            "InvDomain": "blah.masked.com",
            "InvSerialNumber": "86753092",
            "InvDevType": "Workstation",
            "InvRegion": "NA",
            "InvLocation": "114AP4",
            "InvNetwork": "WMG",
            "InvBrand": "RMR",
            "InvProtectionStatus": null,
            "InvConversionStatus": null,
            "InvDeviceDob": "2023-10-23"
        }, {
            "InvStatus": {
                "Description": "Disconnected"
            },
            "DeviceAddresses": [],
            "People": [{
                    "Role": {
                        "Description": "Asset Manager"
                    },
                    "Id": "65571",
                    "Email": "[email protected]"
                }, {
                    "Role": {
                        "Description": "Primary Technical Contact"
                    },
                    "Id": "65477",
                    "Email": "[email protected]"
                }, {
                    "Role": {
                        "Description": "Asset User"
                    },
                    "Id": "65478",
                    "Email": "[email protected]"
                }, {
                    "Role": {
                        "Description": "Secondary Technical Contact"
                    },
                    "Id": "65479",
                    "Email": "[email protected]"
                }, {
                    "Role": {
                        "Description": "Secondary Technical Contact"
                    },
                    "Id": "65475",
                    "Email": "[email protected]"
                }
            ],
            "Id": "44",
            "InvHostName": "-0cc05ac548317d30c",
            "InvDomain": "WORKGROUP",
            "InvSerialNumber": "n/a",
            "InvDevType": "Server",
            "InvRegion": "NA",
            "InvLocation": "FREM45",
            "InvNetwork": "WMB",
            "InvBrand": "OKS",
            "InvProtectionStatus": null,
            "InvConversionStatus": null,
            "InvDeviceDob": "2021-05-03"
        }
    ]
}

I have attempted doing it all in one step and that only returned to me the first item in the nested results.

Next I attempted to do it in multiple steps where the first step retrieved the first level of fields, the 2nd the second level, and so on.

My first step I am retrieving these fields: PDI Screenshot 1

My second step I am retrieving these fields: PDI Screenshot 2

What ends up happening is the People fields from the 2nd step duplicate to all of the rows retrieved in the 1st step rather than just causing a duplication to the 3rd record from the 1st step.

Above JSON is just a small sample, my actual records that I'll be pulling back is over 100k and I only want to have the People records associated with the Devices.


Solution

  • I was able to figure this out and wanted to let anyone else looking for the solution know how I solved it.

    The second methodology was the correct path, where I would parse out the first level of fields in the first step and then in the second step I would parse out the second level of fields.

    The added twist I needed to do was pull out the first level's array of the secondary array into an additional field, in my example I called it PeopleData. This put just that small JSON array into its own field that was directly tied to that row's data and not every other row.

    See screenshot:

    PDI Screenshot 1

    Then in my next step I simply referenced that new field JSON and parsed that data out using [*] and because this was unique to that row only it kept the relationships to the top level intact.

    See screenshot:

    PDI Screenshot 2

    I hope this helps someone.