Search code examples
pentahopentaho-cdepentaho-spoon

How to split a CSV file into groups using Pentaho?


I am new to Pentaho and am trying to read a CSV file (which I already did) and create blocks of data based on an identifier.

Eg

1|A|B|C
2|D|E|F
8|G|H|I|J|K
4|L|M
1|N|O|P
4|Q|R|S|T
5|U|V|W

I need to split and group this as such:
(each block starts when the first column is equal to '1')

Block a)

1|A|B|C
2|D|E|F
8|G|H|I|J|K
4|L|M

Block b)

1|N|O|P
4|Q|R|S|T
5|U|V|W

Eg

 a |1|A|B|C
 a |2|D|E|F
 a |8|G|H|I|J|K
 a |4|L|M

 b |1|N|O|P
 b |4|Q|R|S|T
 b |5|U|V|W

How can this be achieved using Penatho? Thanks.

I found a similar question but answers don't really help my case Pentaho Kettle split CSV into multiple records


Solution

  • I think I got the answer.

    I created the transformation in this zip that can transform your "csv" file in rows almost like you described but I don't know what you intend to do next, so maybe you can give us more details. =)

    I'll explain what I did:

    1) First, we grab the row full text with a Text input step

    Transformation

    When you look at configurations of Text Input step, you'll see I used a ';' has separator, when your input file uses '|' so I'm not spliting columns with the '|' but loading the whole line in one column. Grabbing the row's full text, nothing else.

    2) Next we apply a regex eval to separate the ID from the rest of our string.

    ^(\d+)\|(.*)
    

    Which means: in the beginning of the text I expect one or more digits followed by a pipe and anything after that. Capture the digits in the beginning of the string in one column and everything after the pipe to another column.

    That gives you this output: (blue is the first capture group, red is the second) regex

    3) Now what you need is to add a 'sequence' that only goes up if there is a row_id = 1. Which I did in the Mod JS Value with the following code:

    var sequence
    
    //if it's the first row, set sequence to 1
    if(sequence == null){
        sequence = 1;
    }else{
    //if it's not the first row, check if the row_id is equal to 1 (string)
        if(row_id == '1'){
            // increment the sequence
            sequence++;
        }else{
            //nothing
        }
    }
    

    And that will give you this output that seem to be what you expected: (green, the group/sequence done)

    sequence

    Hope it helps =)