Search code examples
pentahoetlkettle

Pentaho to convert tree structure data


I have a stream of data from a CSV. It is a flat structured database.

E.g.:

a,b,c,d
a,b,c,e
a,b,f

This essentially transforms into:

 Node id,Nodename,parent id,level
 100, a , 0 , 1
 200, b , 100 , 2
 300, c , 200 , 3
 400, d , 300 , 4
 500, e , 300 , 4
 600, f , 200 , 3

Can this be done using Pentaho? I have gone through the transformation steps. But nothing strikes me as usable for this purpose. Please let me know if there is any step that I may have missed.


Solution

  • Your CSV file contains graph or tree definition. The output format is rich (node_id needs to be generated, parent_id needs to be resolved, level needs to be set). There are few issues you will face when processing this kind of CSV file in Pentaho Data Integration:

    Data loading & processing:

    • Rows do not have same length (sometimes 4 nodes, sometimes 3 node).
    • Load whole rows. And then split rows to nodes and process one node per record stream item.
    • You can calculate output values in the same step as where the nodes are split.

    Solution Steps:

    1. CSV file input: Load data from CSV. Settings: No header row; Delimiter = ';'; One output column named rowData
    2. Modified Java Script Value: Split rowData to nodes and calculate output values: nodeId, nodeName, parentId, nodeLevel [See the code below]
    3. Sort rows: Sort rows by nodeName. [a,b,c,d,a,b,c,e,a,b,f >> a,a,a,b,b,c,c,d,e,f]
    4. Unique rows: Delete duplicate rows by nodeName. [a,a,a,b,b,c,c,d,e,f >> a,b,c,d,e,f]
    5. Text file output: Write out results.

    Modified Java Script Value Code:

    function writeRow(nodeId, nodeName, parentId, nodeLevel){
        newRow = createRowCopy(getOutputRowMeta().size());
        var rowIndex = getInputRowMeta().size();
    
        newRow[rowIndex++] = nodeId;
        newRow[rowIndex++] = nodeName;
        newRow[rowIndex++] = parentId;
        newRow[rowIndex++] = nodeLevel;
    
        putRow(newRow);
    }
    
    var nodeIdsMap = {
        a: "100",
        b: "200",
        c: "300",
        d: "400",
        e: "500",
        f: "600",
        g: "700",
        h: "800",
    }
    
    // rowData from record stream (CSV input step)
    var nodes = rowData.split(",");
    
    for (i = 0; i < nodes.length; i++){
        var nodeId = nodeIdsMap[nodes[i]];
        var parentNodeId = (i == 0) ? "0" : nodeIdsMap[nodes[i-1]];
        var level = i + 1;
        writeRow(nodeId, nodes[i], parentNodeId, level);
    }
    trans_Status = SKIP_TRANSFORMATION;
    

    Modified Java Script Value Field Settings:

    • Fieldname; Type; Replace value'Fieldname' or 'Rename to'
    • nodeId; String; N
    • nodeName; String; N
    • parent_id; String; N
    • nodeLevel; String; N