Search code examples
hadoophivesqoopinformatica-powercenter

Sqoop vs Informatica Big Data edition for Data sourcing


I have a option of using Sqoop or Informatica Big Data edition to source data into HDFS. The source systems are Tearadata, Oracle.

I would like to know which one is better and any reason behind the same.

Note: My current utility is able to pull data using sqoop into HDFS , Create Hive staging table and archive external table.

Informatica is the ETL tool used in the organization.

Regards Sanjeeb


Solution

  • Sqoop

    • Sqoop is capable of performing full and incremental loading from Oracle/Teradata.
    • Sqoop does parallel copy of data from source systems.
    • Sqoop scripts can be custom genrated and scheduled by Oozie.
    • Open source solution for any size cluster. No license cost.

    Informatica

    • Best Interface in ETL Industry to manage mappings.
    • Does not provide parallel copy options. Provides Hive mode for parallel processing. Basically converts transformation into Hive queries for execution. Also supports push downs to generate MR code.
    • Licensing cost per node. If you plan 500 Hadoop nodes for future data storage you need to pay 10 times as compared with 50 node cluster when you scale cluster.
    • Informatica BDE is relatively new product in market. INFA Developer will be usefull for working on Big data. There are challenges in supporting all latest Hadoop platform features on Informatica, also traditional RDBMS features like Sequence generation, Stateful mapping,Sessions, Lookup Transformation in Informatica BDE.
    • Informatica MDM does not support Hadoop.

    If price is criteria for decision making, go for Sqoop. If you want to leverage flexibility of switching Hadoop plaftorm tools, use Sqoop(Sqoop project is also thinking of moving over Spark). If you are tied to Informatica for some reason, go for Informatica. But most Informatica developers want to move to Hadoop technologies.