hbase bigdata streaming spark-streaming apache-kafka-streams

Analysis on real time streaming data

This is a relatively broad question and I am aware of tools I would possibly need to use for a problem like this (For ex. Spark, Kafka and Hadoop) but I am looking for a concrete vision from an experienced professional's perspective

Here's what the problem at hand looks like:

We are using a google analytics like service, which is sending us a stream of events. An event is an action performed on the page. It could be a click on a button, mouse movement, page scroll or a custom event defined by us.

{
"query_params":[

],
"device_type":"Desktop",
"browser_string":"Chrome 47.0.2526",
"ip":"62.82.34.0",
"screen_colors":"24",
"os":"Mac OS X",
"browser_version":"47.0.2526",
"session":1,
"country_code":"ES",
"document_encoding":"UTF-8",
"city":"Palma De Mallorca",
"tz":"Europe/Madrid",
"uuid":"A37F2D3A4B99FF003132D662EFEEAFCA",
"combination_goals_facet_term":"c2_g1",
"ts":1452015428,
"hour_of_day":17,
"os_version":"10.11.2",
"experiment":465,
"user_time":"2016-01-05T17:37:10.675000",
"direct_traffic":false,
"combination":"2",
"search_traffic":false,
"returning_visitor":false,
"hit_time":"2016-01-05T17:37:08",
"user_language":"es",
"device":"Other",
"active_goals":[
1
],
"account":196,
"url”:”http://someurl.com”,
“action”:”click”,
"country":"Spain",
"region":"Islas Baleares",
"day_of_week":"Tuesday",
"converted_goals":[

],
"social_traffic":false,
"converted_goals_info":[

],
"referrer”:”http://www.google.com”,
"browser":"Chrome",
"ua":"Mozilla/5.0 (Macintosh; Intel Mac OS X 10_11_2) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/47.0.2526.106 Safari/537.36",
"email_traffic":false
}

Now we need to build a solution to analyse this data. We need to make a reporting platform which can aggregate, filter and slice and dice the data.

One example of the report we need to build is

Show me all the users who are coming from United States and are using the chrome browser and are using the browser on an iPhone.

Show me the sum of clicks on a particular button of all the users who are coming from referrer = “http://www.google.com” and are based out of India and are using Desktop. In one day this service sends out millions of such events amounting to GB’s of data per day.

Here are the specific doubts I have

How should we store this huge amount of data
How should we enable ourselves to analyse the data in real time.
How should the query system work here (I am relatively clueless about this part)
If we are looking at maintaining data of about 4 TB which we estimate to accumulate over 3 months, what should be the strategy to retain this data. When and how should we delete this?

Solution

How should we store this huge amount of data.

Use one of the cloud storage providers (link) Partition the data based on date and hour (date=2018-11-25/hour=16), this will reduce the amount of data being read per query. Store the data in one of the binary formats like parquet or ORC, will give you better performance and compression ratio.

How should we enable ourselves to analyse the data in real time.

You can run multiple applications listening on a kakfa topic. First store the events to a storage using spark structured streaming 2.3 with continuous mode application (link). This will give you option to query and analyze historical data and re-process events if required. You have two options here:

Store in hdfs/s3/gcp storage etc. Build a hive catalog on the data stored to get a live view of the events. Can use spark/hive/presto to query the data. note: compaction will be required if small files are being generated.
Store in a wide column store like Cassandra or HBase. link I would prefer this option for this use case.

Run another spark application in parallel for real-time analysis, if you know the dimensions and metrics on which you have to aggregate the data, use spark structured streaming with windowing. You could group by the columns and window every min or 5 mins and store in one of the above-mentioned storage providers which can be queried in real time. link

How should the query system work here

As mentioned in answer 3, build a hive catalog on the data stored to get a live view of the events. For reporting purpose, use spark/hive/presto to query the data. If queried on real-time data, use Cassandra or HBase as low latency systems.

If we are looking at maintaining data of about 4 TB which we estimate to accumulate over 3 months, what should be the strategy to retain this data. When and how should we delete this?

If you have partitioned the data properly, you can archive data to cold backup based on a periodic archive rule. For e.g., dimensions and metrics generated from events can be maintained and events can be archived after 1 month.