Search code examples
scalahadoopcascadingflume

What is the most mature library for building a Data Analytics Pipeline in Java/Scala for Hadoop?


I found many options recently, and interesting in their comparisons primarely by maturity and stability.

  1. Crunch - https://github.com/cloudera/crunch
  2. Scrunch - https://github.com/cloudera/crunch/tree/master/scrunch
  3. Cascading - http://www.cascading.org/
  4. Scalding https://github.com/twitter/scalding
  5. FlumeJava
  6. Scoobi - https://github.com/NICTA/scoobi/

Solution

  • Scalding also has the advantage of significant open source projects built atop it, such as Matrix API and Algebird.

    Here are some examples: http://sujitpal.blogspot.com/2012/08/scalding-for-impatient.html

    Cascalog was released almost two years before Scalding, and arguably has more advanced features for building robust workflows: https://github.com/nathanmarz/cascalog/wiki