Search code examples
mysqlapache-sparkhbasesqoop

Near real time sync from mysql to Hbase


Currently I am facing a issue during sync the data from mysql to hbase, I need a near real time data sync from mysql to hbase, and I need to merge multiple mysql tables into one hbase table during the data sync.

I tried sqoop looks like it can not fit our requirements.

So are there any existing tools/libs can be used for my case, or any other solutions I can try with spark.


Solution

  • Consider using Apache Phoenix on HBase. It will give you low-latency SQL queries (so it is suitable for OLTP and easy to use for OLAP) on data stored in HBase so you don't have to worry about syncing. It also has NoSQL features such as the ability to dynamically add columns during query-time.

    To satisfy your use case, you could run Phoenix for OLTP, and a second instance of Phoenix on a read replica to run table joins for OLAP.

    http://www.cloudera.com/documentation/enterprise/5-4-x/topics/admin_hbase_read_replicas.html

    Secondary replicas are refreshed at intervals controlled by a timer (hbase.regionserver.storefile.refresh.period), and so are guaranteed to be at most that interval of milliseconds behind the primary RegionServer.

    This solution satisfies your requirements for OLTP, OLAP, and near real-time syncing while giving your transactional database scalability that you would not easily have with MySQL. Apache Phoenix also offers full integration with the Hadoop ecosystem so it will integrate well with your current analytics stack.