Search code examples
hadoopteradatainformaticainformatica-powercenterbigdata

What is the best way to ingest data from Terdata into Hadoop with Informatica?


What is the best ways to parallel ingest data from Teradata database into Hadoop with parallel data moving?

  • If we create a job which is simple opens one session to Teradata database it will take a lot of time to load huge table.
  • if we create a set of sessions to load data in parallel, and also make Select in each of the sessions, than it will make a set of Full table scans Teradata to produce a data

What is the recommended best practice to load data in parallelised streams and make unnecessary workload to Teradata?


Solution

  • The most scalable way to ingest data into Hadoop form teradata, which i found is to use Teradata connector for hadoop. It is included in Cloudera & Hortonworks distributions. I will show example base on Cloudera documentation, but the same works with Hortonworks as well:

    Informatica big Data edition is using standard Scoop invocation via command line and submitting set of parameters to it. So the main question is - which driver to use to make parallel connections between two MPP systems.

    Here is the link to the Cloudera documentation: Using the Cloudera Connector Powered by Teradata

    And here is the digest from this documentation (You could find that this connector support different kinds of load balancing between connections):

    Cloudera Connector Powered by Teradata supports the following methods for importing data from Teradata to Hadoop:

    • split.by.amp
      • split.by.value
      • split.by.partition
      • split.by.hash
      • split.by.amp Method

    This optimal method retrieves data from Teradata. The connector creates one mapper per available Teradata AMP, and each mapper subsequently retrieves data from each AMP. As a result, no staging table is required. This method requires Teradata 14.10 or higher.