Search code examples
mysqlmongodbjoinanalyticshbase

JOINs across MySQL and MongoDB/HBase for analytics data


We are looking to build an internal real-time analytics system using MongoDB or HBase (other options possible, but these are the two we are looking at right now.) We picked these two because:

  • We will be recording a huge number of events, and want to be able to horizontally and vertically scale.
  • We will get "long tail" requests - for example, "give me the metrics for this item three years ago" - and hence need to do real-time queries over a large range, whether time or other values.

We have tried to use MySQL for the analytics tables, but the tables get large really fast and we don't want our analytics tables to be a "dead weight" on the rest of our MySQL system. (I am open to advice on using MySQL or other solutions for analytics tables if anyone has insight.)

So, the question is: we will often have to do real-time lookups and JOINs across both our analytics data store and MongoDB/Hbase. For instance, we will be storing the "id" of a table in analytics, but for any request, we will have to "join" it against, say, the "title" and "content" fields of a "content" table in MySQL.

Is there any way to do JOINs across MongoDB/HBase and MySQL, or is this something we would have to custom-build into our application? And if we build it in the application, how would we deal with custom JOINs on large data sets across MySQL and MongoDB/HBase?


Solution

  • Is there any way to do JOINs across MongoDB/HBase and MySQL...

    MongoDB has absolutely zero support for JOINs. You would basically have to build your own join system if you use MongoDB

    HBase is build over top of Hadoop and, in theory, some JOIN-like syntax is possible.

    However, joining between MySQL and HBase is almost certainly custom code. You may want to consider copying the MySQL data into HBase.

    ... need to do real-time queries over a large range... "give me the metrics for this item three years ago"

    Recognize that this is going to be a tremendous challenge and that you are unlikely to find anything "out of the box" that does this for you.

    You will need to be very judicious about the term "real-time". It sounds like you want "relatively responsive queries" rather than "real-time queries". You will have to scope "relatively responsive" as this will massively affect both the cost and the products you use.