Search code examples
excelpentahoetlkettlespoon

ETL - How to add table titles as field values to every following row in Pentaho Spoon


I'm new to Pentaho Spoon (Kettle) and have the following task:

There is a excel table, it could look like this:

history class
name      first name    id
Doe       John          213
Muster    Max           118
biology class
name      first name    id
Parker    Peter         844

The problem is that there are actualy more tables inside one xls-file. In this example there is the table for students of the history class and for the ones of the biology class.

What I need as output: a table for all students with a new field for each student which says which class he was part of. Example:

name      first name   id        class
Doe       John         213       history class
Muster    Max          118       history class
Parker    Peter        884       biology class

(my examples are extremly simplified)

How can I transform the input table to the output table? Thank you for your help!


Solution

  • You can't normally parse excel with multiple tables in pentaho. I.e. best way is to have multiple sheets for different tables.

    If it's not possible, then there are at least two options:

    1. Find max number of cols and specify cols manually with string type (because content of one column may be different - numbers, strings, etc.) and then parse rows according to some rules
    2. Extract sheet content to text using POI - new XSSFExcelExtractor(new XSSFWorkbook(new File(fileName))).getText() - and then parse it using java.

    Here is option 1. for your simple case:

    1. Excel input fields (header is unchecked!):

    enter image description here

    1. JS step that adds className field that contains current class name:

    enter image description here

    Result after second step:

    enter image description here

    1. Final step - we filter out header rows (i.e. where first name is empty or = "first name".

    Result:

    enter image description here