Search code examples
etltalend

Convert semi-structured data to structured data using talend BigData


Employee
 Employee Type                          : 0130
 Unit                                   : 4189670095711234
 Basic Salary                           : 11.00
 Joined Date                            : 04/12/yy 06:30:05
 Country                                : 826-United Kingdom

(123.66)                      --- Endof Employee -------------

R 4567 ABCD             -> Len f---- i 01/14

Employee
 Employee Type                          : 0120
 Unit                                   : 4189670095711234
 Basic Salary                           : 11.00
 Joined Date                            : 04/12/yy 06:30:05
 Country                                : 826-United Kingdom

(123.66)-                      --- Endof Employee ------------

R 4567 ABCD             -> Len f---- i 01/14

Employee
 Employee Type                          : 0130
 Unit                                   : 4189670095711235
 Basic Salary                           : 11.00
 Joined Date                            : 04/12/yy 06:30:05
 Country                                : 826-United Kingdom

(123.66)                      --- Endof Employee -------------

Hi,

I would like to convert the following semi-structured data to structured data using talend.

Please let me know how can i convert the data to structured form and so that i can insert it into a relational table.


Solution


  • Here is a solution, thanks to tPivotToColumnsDelimited component. enter image description here

    tFileInputDelimilted is associated with a 2 fields schema (nammed property and value) and has a special field separator which is " : " (space-colon-space).
    Avanced Setting options "Trim all columns" and "Check each row structure against schema" are ticked.

    tMap is here to associate a rank for each input line depending the "property" name: enter image description here As you can see, the sequence name is based on the property name, so each file record for the same employee will have the same rank value.

    Finally, tPivotToColumnsDelimited move on a single line all the input records with the same rank value and, most important, values are associated to the rigth property. enter image description here Set "Pivot column" as "property", "Aggregation column" as "value", "Aggregation function" as "first" and "Group by" as "rank". Select the desired filename for the output and finally you will get the desired result: enter image description here

    Hope this helps.