Search code examples
talend

Catch data errors in Talend


I have some data that occasionally has errors that need to be checked/validated rather than be loaded to my target system.

How can I do this? For example:

Input:

Local number
------------
789-523

Output:

LocalNumber
------------
789-523

This kind of value needs validation. There should be another number at the last. How I can I get this kind value in my input file and put it on a separate file for validation?


Solution

  • At a high level you'll need to define what you consider data errors using some components and use those to filter the data.

    As an example you might have the following data:

    .--+-----+-----------+--------------------+--------+-----.
    |id|name |phone      |address             |city    |state|
    |=-+-----+-----------+--------------------+--------+----=|
    |1 |Bob  |02071234568|165 Lake Tahoe Blvd.|Richmond|MN   |
    |2 |Susan|02071234567|345 E Fowler Avenue |Helena  |CA   |
    |3 |Jimmy|foobar     |222 Tully Road East |Bismarck|MA   |
    |4 |Janet|07811111111|230 Camelback Rd    |Boise   |GB   |
    '--+-----+-----------+--------------------+--------+-----'
    

    I'm using British phone numbers here for US address' and states but that's because I can't think of any useful US phone numbers :)

    Here we want to check whether the phone number is valid and whether and the state is valid too. For now we're just going to print the results of everything to the console using a tLogRow but equally this could be any kind of output including log files, databases or even the Talend Data Stewardship Console. A quick job might look like this:

    Data validation job layout

    To check whether the phone number is valid (and also optionally standardise it to a predefined format) we can use the tStandardizePhoneNumber component:

    tStandardizePhoneNumber component

    This then adds some columns to your schema including whether the phone number is valid and also a standardised output.

    We then use a tMap to filter on whether the phone number is valid and at the same time replace the number with the provided standardised phone number (in this case an international formatted phone number):

    tMap to filter valid and invalid phone numbers

    After this we can use a lookup to a list of valid US states and inner join this in a tMap to check if the state is valid. We also use this opportunity to get the full state name:

    tMap to filter valid and invalid US states

    This general principle applies to how you apply any data validation: use a component or some logic (either in a tMap or something like a tPatternCheck) to determine if the data is valid and then use a filtering component (the tPatternCheck is already a filtering component) to direct your output.

    If you're looking to validate more basic things such as the metadata of the data e.g., the length of the column or the data type then you can use a tSchemaComplianceCheck to filter rows of data that don't match the predefined schema.