Search code examples
pentahopentaho-spoonpentaho-data-integration

Interpolation of missing values in Pentaho


I am trying to fill down the missing values using Pentaho pdi.

Input:

enter image description here

Desired output:

enter image description here

Found so far only Filling data gaps in a stream in Pentaho Data Integration, is it possible? but it fills in with the last known value.

Potentially, I thought I could work with the above solution, I also added the next amount to the analytical query, along with the next date. Then, I added the flag in the clone step and filter the original results from the input into Dummy and generated results (from the calculator) to a calculator (at the moment). Then, potentially, I can dump that separate stream to a temp table in a database and run the sql query which will do the rolling subtraction. I am also investigating the javascript step.

I disregarded the Python or R Executor step because at the end I will be running the job on the aws vm and I already foresee the pain I will go through with the installation.

What would be your suggestions? Is there a simple way to do interpolation?

Updated for the question enter image description here

enter image description here


Solution

  • The method provided in your link does work from my testing, (I am using LAG instead of LEAD for your tasks though). Here I am not looking to replicate that method, just another option for you by using JavaScript to build the logic which you might also extend to other applications:

    In the testing below (tested on PDI-8.0), the transformation has 5 steps, see below

    enter image description here

    1. Data Grid step to create testing data with three fields: date, account number and amount
    2. Sort rows to sort the rows based on account number and date. this is required for Analytic Query step, if your source data are already sorted, then skip this step
    3. Analytic Query step, see below, create two more fields: prev_date and prev_amount

    enter image description here

    1. Modified Java Script Value step, add the following code, nothing else is needed to configure in this step:

      var days_diff = dateDiff(prev_date, date, "d")
      
      if (days_diff > 0) {
          /* retrieve index for two fields: 'date', 'amount' 
           * and modify their values accordingly
           */
          var idx_date = getInputRowMeta().indexOfValue("date")
          var idx_amount = getInputRowMeta().indexOfValue("amount")
      
          /* amount to increment by each row */
          var delta_amount = (amount - prev_amount)/days_diff
      
          for (var i = 1; i < days_diff; i++) {
              newRow = createRowCopy(getOutputRowMeta().size());
              newRow[idx_date]   = dateAdd(prev_date, "d", i);
              newRow[idx_amount] = prev_amount + delta_amount * i;
              putRow(newRow);
          } 
      }
      
    2. Select values step to remove unwanted fields, i.e.: prev_date, prev_amount

    Run the transformation, you will have the following shown under the Preview data tab of Modified Java Script Value step:

    enter image description here

    UPDATE:

    Per your comments, you can do the following, assume you have a new field account_type:

    1. in Analytic Query step, add a new field prev_account_type, similar to two other prev_ fields, just from different Subject: account_type

    2. in Modified Java Script Value step, you need to retrieve the Row index for account_type and modify the logic to compute delta_amount, so when prev_account_type is not the same as the current account_type, the delta_amount is ZERO, see below code:

      var days_diff = dateDiff(prev_date, date, "d")
      
      if (days_diff > 0) {
          /* retrieve index for three fields: 'date', 'amount', 'account_type' */
          var idx_date     = getInputRowMeta().indexOfValue("date")
          var idx_amount   = getInputRowMeta().indexOfValue("amount")
          var idx_act_type = getInputRowMeta().indexOfValue("account_type")
      
          /* amount to increment by each row */
          var delta_amount = prev_account_type.equals(account_type) ? (amount - prev_amount)/days_diff : 0;
      
          /* copy the current Row into newRow and modify fields accordingly */
          for (var i = 1; i < days_diff; i++) {
              newRow = createRowCopy(getOutputRowMeta().size());
              newRow[idx_date]     = dateAdd(prev_date, "d", i);
              newRow[idx_amount]   = prev_amount + delta_amount * i;
              newRow[idx_act_type] = prev_account_type;
              putRow(newRow);
          } 
      }
      

    Note: invoking Javascript interpreter does have some performance hit, so if that matters to you, stick to the method in the link you provided.