Search code examples
talenddata-integration

Using Talend Open Studio DI to extract extract value from unique 1st row before continuing to process columns


I have a number of excel files where there is a line of text (and blank row) above the header row for the table.

What would be the best way to process the file so I can extract the text from that row AND include it as a column when appending multiple files? Is it possible without having to process each file twice?

Example

This file was created on machine A on 01/02/2013

Task|Quantity|ErrorRate

0102|4550|6 per minute
0103|4004|5 per minute

And end up with the data from multiple similar files

Task|Quantity|ErrorRate|Machine|Date
0102|4550|6 per minute|machine A|01/02/2013
0103|4004|5 per minute|machine A|01/02/2013
0467|1264|2 per minute|machine D|02/02/2013

Solution

  • I put together a small, crude sample of how it can be done. I call it crude because a. it is not dynamic, you can add more files to process but you need to know how many files in advance of building your job, and b. it shows the basic concept, but would require more work to suite your needs. For example, in my test files I simply have "MachineA" or "MachineB" in the first line. You will need to parse that data out to obtain the machine name and the date.

    But here is how may sample works. Each Excel is setup as two inputs. For the header the tFileInput_Excel is configured to read only the first line while the body tFileInput_Excel is configured to start reading at line 4.

    In the tMap they are combined (not joined) into the output schema. This is done for the Machine A Excel and Machine B excels, then those tMaps are combined with a tUnite for the final output.

    As you can see in the log row the data is combined and includes the header info.

    enter image description here