Search code examples
pentahoetlkettle

Pentaho Kettle Group by with type concatenate produces duplicates


I have following Excel file:

Excel sheet screenshot

My end result should looks like (example of output file for first entry):

<?xml version="1.0" encoding="utf-8"?>
<asset>
    <asset_feature key="ItemID" value="12000"/>
    <asset_feature key="Name" value="Name of the first item"/>
    <asset_feature key="Price" value="1020.0"/>
</asset>

In order to achieve it on initial steps I use Row Normaliser step and after it Group by with type Concatenate strings separated by ,. I am expecting to have this:

<asset_feature key="ItemID" value="12000"/>, <asset_feature key="Name" value="Name of the first item"/>, <asset_feature key="Price" value="1020.0"/>

and use technic from XML Add - creating multi level XML files example for >, < replacement, as: var xmlOutput=xmlOrderListNew.replace(">, <","><");, but instead I get following result back (3 duplicated rows):

<asset_feature key="ItemID" value="12000"/>, <asset_feature key="Name" value="Name of the first item"/>, <asset_feature key="Price" value="1020.0"/>
<asset_feature key="ItemID" value="12000"/>, <asset_feature key="Name" value="Name of the first item"/>, <asset_feature key="Price" value="1020.0"/>
<asset_feature key="ItemID" value="12000"/>, <asset_feature key="Name" value="Name of the first item"/>, <asset_feature key="Price" value="1020.0"/>

so, to fix it I have to put extra transformation step for getting just Unique rows. Transformation picture:

Transformation flow

Question is: how to fix Group by step to don't get duplicates without extra Unique rows step, as XML Add - creating multi level XML files doesn't have such issue?

Optional question 1: is there any simple way for concatenation in order to avoid replace(">, <","><") step later?

Optional question 2: How to make pretty XML (with indentation) formatting?

Example Excel & transformation are here

Thank you in advice for your help.


Solution

  • Question 0 (Avoid duplicate in group by): Un-check the Include all row check box!

    Question 1 (Avoid "," in concatenation): Use a Concatenate string separated by (without coma) and do not put anything as value.

    Question 2 (Petty print xml/html): You can put any function in a javascript step. In particular you can copy a xml pretty printer, for example formatXML.js

    Un-asked question: Use Memory Group by rather than Group by step, unless you have really really huge data.