Search code examples
etltalenddatabase-normalization

Talend Normalize Flat File into Relational Database Tables


We have a single source table, which is flat — we need to insert different fields from a given record into multiple tables. We are successfully using lastInsertID a single time, but we are struggling with how to re-add fields from the same source row again in subsequent related tables.

For example, if we had a mailing address (goofy example coming up, but good for common discussion)

-----------Source----------
First Name
Middle Name
Last Name
Address 1
Address 2
City
State
Zip

-----------Targets-------------

People
 address_id
 First Name
 Last Name

Address
 address_id
 state_id
 zip_id
 Address 1
 Address 2

States
 state_id
 State Name

Zip
 zip_id
 Zip Code

Furthermore, we cannot be sure, that we may not need to add the same column to more than one table.

What is the best practice for this kind of data normalization in Talend?


Solution

  • I would approach this iteratively, normalising part of the table with each step.

    You should be able to normalise the person data away from the address, state and zip data in one step and then normalise the state away from the address and zip data and then finally normalise the zip data away from the rest of the address.

    As an example, and following on from the example in your question here's a few jobs that will do exactly that:

    To start with we should create the example data. I'm going to use MySQL for this example but the same principles apply for any of the major RDBMS'.

    Let's create an empty table to start with:

    DROP DATABASE IF EXISTS normalisation;
    
    CREATE DATABASE IF NOT EXISTS normalisation
    CHARACTER SET utf8 COLLATE utf8_unicode_ci;
    
    CREATE TABLE IF NOT EXISTS normalisation.denormalised (
        FirstName VARCHAR(255),
        MiddleName VARCHAR(255),
        LastName VARCHAR(255),
        Address1 VARCHAR(255),
        Address2 VARCHAR(255),
        City VARCHAR(255),
        State VARCHAR(255),
        Zip VARCHAR(255)
    ) ENGINE = INNODB;
    

    Into this we need to populate it with some example data which can be done easily enough with Talend's tRowGenerator component:

    Row generating job layout

    I've configured the tRowGenerator to give us some semi sensible testing output:

    tRowGenerator component

    I've also added an extra step to add some co-habitors to ~1/3 of addresses using the following tMap configuration:

    tMap configuration to add co-habitors to some addresses

    Now that we have our test data easily generated we can move on to actually normalising the data from this denormalised table.

    As mentioned above, our first step is to normalise the person data out. We start by creating the necessary tables for the person data and the remaining address data:

    CREATE TABLE IF NOT EXISTS normalisation.person (
        Person_id BIGINT AUTO_INCREMENT PRIMARY KEY,
        FirstName VARCHAR(255),
        MiddleName VARCHAR(255),
        LastName VARCHAR(255),
        Address_id BIGINT
    ) ENGINE = INNODB;
    
    CREATE TABLE IF NOT EXISTS normalisation.addressStateZip (
        Address_id BIGINT AUTO_INCREMENT PRIMARY KEY,
        Address1 VARCHAR(50),
        Address2 VARCHAR(50),
        City VARCHAR(50),
        State VARCHAR(50),
        Zip VARCHAR(50),
        UNIQUE KEY addressStateZip (Address1, Address2, City, State, Zip)
    ) ENGINE = INNODB;
    

    We then populate these 2 tables by getting all of the address type data, taking only the unique rows and then putting this into the addressStateZip staging table:

    Normalise people data job layout

    The second part of the above job then compares the addressStateZip data to the initial denormalised table and collecting the joins to get the Address_id for the person table:

    tMap configuration to get Address_id for person table

    The remaining steps are now quite similar.

    Next we create the state table and another staging table for the address and zip data:

    CREATE TABLE IF NOT EXISTS normalisation.state (
        State_id BIGINT AUTO_INCREMENT PRIMARY KEY,
        State VARCHAR(255),
        UNIQUE KEY state (State)
    ) ENGINE = INNODB;
    
    CREATE TABLE IF NOT EXISTS normalisation.addressZip (
        Address_id BIGINT AUTO_INCREMENT PRIMARY KEY,
        Address1 VARCHAR(50),
        Address2 VARCHAR(50),
        City VARCHAR(50),
        State_id BIGINT,
        Zip VARCHAR(50),
        UNIQUE KEY addressStateZip (Address1, Address2, City, State_id, Zip)
    ) ENGINE = INNODB;
    

    Now we need to take the unique states from the addressStateZip table and put these into the state table:

    Normalise state data job layout

    And the second part, as before, then creates the data into the addressZip staging table with the State_id instead of the actual state:

    tMap configuration to get State_id for addressZip table

    Now, finally, we can create our zip table and then link that to a proper address table:

    CREATE TABLE IF NOT EXISTS normalisation.zip (
        Zip_id BIGINT AUTO_INCREMENT PRIMARY KEY,
        ZIP VARCHAR(255),
        UNIQUE KEY zip (ZIP)
    ) ENGINE = INNODB;
    
    CREATE TABLE IF NOT EXISTS normalisation.address (
        Address_id BIGINT AUTO_INCREMENT PRIMARY KEY,
        Address1 VARCHAR(50),
        Address2 VARCHAR(50),
        City VARCHAR(50),
        State_id BIGINT,
        Zip_id BIGINT,
        UNIQUE KEY addressStateZip (Address1, Address2, City, State_id, Zip_id)
    ) ENGINE = INNODB;
    

    Using the same methodology as with the state data we get all of the unique zips and put these into the zip table:

    Normalise zip data job layout

    And, as before, we can now put the Zip_id into a new, finished address table:

    tMap configuration to get Zip_id for address table

    And to check things we can now run the following query to get all the data back out:

    SELECT p.FirstName, p.MiddleName, p.LastName, a.Address1, a.Address2, a.City, s.State, z.Zip
    FROM normalisation.person AS p
    INNER JOIN normalisation.address AS a ON a.Address_id = p.Address_id
    INNER JOIN normalisation.state AS s ON s.State_id = a.State_id
    INNER JOIN normalisation.zip AS z ON z.Zip_id = a.Zip_id;
    

    You'll probably also want to add some foreign key constraints to the tables now that you're done setting things up:

    ALTER TABLE normalisation.person
    ADD FOREIGN KEY (Address_id) REFERENCES address(Address_id);
    
    ALTER TABLE normalisation.address
    ADD FOREIGN KEY (State_id) REFERENCES state(State_id),
    ADD FOREIGN KEY (Zip_id) REFERENCES zip(Zip_id);