Search code examples
hadooppentahokettlepdi

Pentaho and Hadoop


I am sorry if this question seems naive, But I am new to Data engineering field, as I am self learner right now, however my questions is what is the differences between ETL products like Pentaho and Hadoop? when I use this instead of that? or I may use them together, how?

Thank you,


Solution

  • An ETL is a tool to Extract data, Transform (join, enrich, filter,...) it and Load the result in another data store. Good ETLS are visual, data store agnostic and easy to automate.

    Hadoop is a data store distributed on a network of clusters plus software to handle diseminated data. The data transformation is specialized on few elementary operations which can be optimized to this usually massive amount of data, like (but not only) Map-Reduce.

    Pentaho Data Integrator has connectors to Hadoop systems which are easy to set up and tune up. So the best strategy is to setup a Hadoop network as data store and manipulate it through the PDI.