hadoop hbase lambda-architecture bigdata

Hbase for real-time application

I want to build a real-time application for predictive maintenance. I thought about using Hbase with Phoenix. Phoenix provides SQL layer on HBase.

I read Hbase is good for Big Data like 100 million rows plus++. But my Application Data has at the moment no Data. How will the Hbase Database react if there is only a small amount of data at the beginning? And is HBase a good solution for real-time web-application?

I want to have a lambda-architecture like system. For Batch and Stream processing. Would HBase on the top of HDFS be my OLTP and OLAP System together?

As the lambda-architecture has a Batch and Speed layer. Can i use HBase data in HDFS also for Batch and save the result back in Hbase?

In general i want to know if HBase is a good solution to build a real-time webapplication to have also the posibility to do analytics.

Solution

HBase is chosen based on the following in general:

Volume : millions and billions is better than thousands and millions

Features : When you do not need transactions, secondary indexes and some RDBMS features

Hardware : Make sure you have sufficient hardware for region servers. It involves good amount of maintenance

More specific:

Its best suited for web applications due to its fast random read queries. But this only comes with very good row key design. This involves you planning out your end queries well in advance and design your row key. Special care needs to be take in row key desing if you also have time based data and your queries heavily depend on it. In short, you should avoid hot spotting. Some info here

Apart from this, selection by other columns values is possible using HBase filters, but very few selections and may not guarantee the web apps response times.

Also, if your data set(rows) have variable number of columns and also you do not need all columns in your queries, HBase is again the best choice

Server(Region) failover is possible in HBase - so your data would be safe.

It can be used both for batch and streaming. Ofcourse, for streaming its the best possible in Big Data stack. However this also depends on your streaming pipeline - like kafka, spark streaming or storm etc.

Since you mentioned Phoenix, I assume you might want to stick to sql view of HBase - this might give you better options. However at the core, row key design is still at the heart of HBase performance