Search code examples
pentahokettle

Filling data gaps in a stream in Pentaho Data Integration, is it possible?


I have a CSV file with currency exchanges EUR-USD. The file was downloaded from the Bank of Canada. I downloaded the CSV with data since Oct 10th, 2013 onwards.

There are, nevertheless, gaps in the data, ie. days without the conversion rates.

I've been fighting (1st day with Spoon Kettle) to find out a simple (but general) way to fill the gaps, say, with the last non-null value. And the only way I've managed to accomplish this is by chaining 4 "Get previous row fields" and the using the NVL in a Calculator to take the first non-null value. But that only works if gaps are not bigger than 4 rows in a stream.

The image represents the transformation:

Painfully filling the gaps

My first question reduces to: Is there a general way to do interpolation/extrapolation in a stream with gaps?

I tried to use the "Modified JavaScript Value" but the API still escapes me. Moreover, it seems that this step only have the Map part of a MapReduce combo, I'd probably need both.

So, my second question is: Is there a way to program a MapReduce combo in a language that is not Java (Scala, Clojure, Jython or JS)?


Solution

  • You can use a combination of the following three steps:

    1) Analytical query - allows you to fetch the value of a field N rows before or after the current row; In your case, you will want to fetch the date 1 row ahead (the next available date)

    2) Calculator - having determined the previous date for the row, use it to calculate Days between dates;

    3) Calculate a field number_of_clones as dbd-1 (the number of days missing;

    4) Use that field on the Clone Rows step to multiple a row as many times as necessary; Add a clone_number field

    5) Add the clone_number as days to the date and you get the day it refers to.

    Moreover, the Analytical query step allows you to specify a field as the "group by" field, so that if you have x-rates for USD and then you have x-rates for GBP, the final USD x-rate day will retrieve null as the next value.

    Here's a sample KTR file:

    Transformation to fill in the gaps

    The data grid step generates a few rows with some data gaps in there:

    some days are missing

    The Analytical query fetches the next date, for the same currency value

    enter image description here

    Then the calculator step calculates how many rows are missing. Note that the last day of each currency will have null as value, so we need to tweak that and use 0 instead (NVL(A,B) returns B if A is null, A otherwise) enter image description here

    Clone rows: takes a row and creates copies. enter image description here

    The clone_number field allows us to calculate the actual date the row refers to enter image description here

    Finally, here's the data. The fields you want are the new_date, currency and exchange_rate. Use a select values to re-order the field list and get rid of those you don't need anymore. enter image description here

    As you can see, now we have data for 2014-01-03 and 2014-01-04, using the previous known value.