Search code examples
jsonmongodbpentahokettleaws-documentdb

Pentaho Kettle: How to dynamically fetch JSON file columns


Background: I work for a company that basically sells passes. Every order that is placed by the customer will contain N number of passes.

Issue: I have these JSON event-transaction files coming into a S3 bucket on a daily basis from DocumentDB (MongoDB). This JSON file is associated to the relevant type of event (insert, modify or delete) for every document key (which is an order in my case). The example below illustrates a "Insert" type of event that came through to the S3 bucket:

{
  "_id": {
    "_data": "11111111111111"
  },
  "operationType": "insert",
  "clusterTime": {
    "$timestamp": {
      "t": 11111111,
      "i": 1
    }
  },
  "ns": {
    "db": "abc",
    "coll": "abc"
  },
  "documentKey": {
    "_id": {
      "$uuid": "abcabcabcabcabcabc"
    }
  },
  "fullDocument": {
    "_id": {
      "$uuid": "abcabcabcabcabcabc"
    },
    "orderNumber": "1234567",
    "externalOrderId": "12345678",
    "orderDateTime": "2020-09-11T08:06:26Z[UTC]",
    "attraction": "abc",
    "entryDate": {
      "$date": 2020-09-13
    },
    "entryTime": {
      "$date": 04000000
    },
    "requestId": "abc",
    "ticketUrl": "abc",
    "tickets": [
      {
        "passId": "1111111",
        "externalTicketId": "1234567"
      },
      {
        "passId": "222222222",
        "externalTicketId": "122442492"
      }
    ],
    "_class": "abc"
  }
}

As we see above, every JSON file might contain N number of passes and every pass is - in turn - is associated to an external ticket id, which is a different column (as seen above). I want to use Pentaho Kettle to read these JSON files and load the data into the DW. I am aware of the Json input step and Row Normalizer that could then transpose "PassID 1", "PassID 2", "PassID 3"..."PassID N" columns into 1 unique column "Pass" and I would have to have to apply a similar logic to the other column "External ticket id". The problem with that approach is that it is quite static, as in, I need to "tell" Pentaho how many Passes are coming in advance in the Json input step. However what if tomorrow I have an order with 10 different passes? How can I do this dynamically to ensure the job will not break?


Solution

  • If you want a tabular output like

    TicketUrl   Pass            ExternalTicketID
    ----------  ------          ----------------
    abc         PassID1Value1    ExTicketIDvalue1
    abc         PassID1Value2    ExTicketIDvalue2
    abc         PassID1Value3    ExTicketIDvalue3
    
    

    And make incoming value dynamic based on JSON input file values, then you can download this transformation Updated Link

    I found everything work dynamic in JSON input.