Search code examples
scalaapache-sparkhbaseapache-flinkscalding

Recommended way to access HBase using Scala


Now that SpyGlass is no longer being maintained, what is the recommended way to access HBase using Scala/Scalding? A similar question was asked in 2013, but most of the suggested links are either dead or to defunct projects. The only link that seems useful is to Apache Flink. Is that considered the best option nowadays? Are people still recommending SpyGlass for new projects even though it isn't been maintained? Performance (massively parallel) and testability are priorities.


Solution

  • Depends on what do you mean by "recommended", I guess.

    DIY

    Eel

    If you just want to access data on HBase from a Scala application, you may want to have a look at Eel, which includes libraries to interact with many storage formats and systems in the Big Data landscape and is natively written in Scala.

    You'll most likely be interested in using the eel-hbase module, which from a few releases includes an HBaseSource class (as well as an HBaseSink). It's actually so recent I just noticed the README still mentions that HBase is not supported. There are no explicit examples with Hive, but source and sinks work in similar ways.

    Kite

    Another alternative could be Kite, which also has a quite extensive set of examples you can draw inspiration from (including with HBase), but it looks less active of a project than Eel.


    Big Data frameworks

    If you want a framework that helps you instead of brewing your own solution with libraries. Of course you'll have to account for some learning curve.

    Spark

    Spark is a fairly mature project and the HBase project itself as built a connector for Spark 2.1.1 (Scaladocs here). Here is an introductory talk that can come to your help.

    The general idea is that you could use this custom data source as suggested in this example:

    sqlContext
      .read
      .options(Map(HBaseTableCatalog.tableCatalog->cat, HBaseRelation.HBASE_CONFIGFILE -> conf))
      .format("org.apache.spark.sql.execution.datasources.hbase")
      .load()
    

    Giving you access to HBase data through the Spark SQL API. Here is a short extract from the same example:

    val df1 = withCatalog(cat1, conf1)
    val df2 = withCatalog(cat2, conf2)
    val s1 = df1.filter($"col0" <= "row120" && $"col0" > "row090").select("col0", "col2")
    val s2 = df2.filter($"col0" <= "row150" && $"col0" > "row100").select("col0", "col5")
    val result =  s1.join(s2, Seq("col0"))
    

    Performance considerations aside, as you may see the language can feel pretty natural for data manipulation.

    Flink

    Two answers already dealt with Flink, so I won't add much more, except for a link to an example from the latest stable release at the time of writing (1.4.2) that you may be interested in having a look at.