Search code examples
elasticsearchamazon-s3amazon-dynamodbtimeline

Load data from S3 to sort and allow timeline analysis


I'm currently trying to find out the best architecture approach for my use case:

I have S3 buckets (two totally separated) which contains data stored in JSON format. Data is partitioned by year/month/day prefixes, and inside particular day I can find e.g hundreds of files for this date

(example: s3://mybucket/2018/12/31/file1, s3://mybucket/2018/12/31/file2, s3://mybucket/2018/12/31/file..n)

Unfortunately inside particular prefix for single day, in those tens..or hundreds files JSONs are not ordered by exact timestamp - so if we follow this example:

s3://mybucket/2018/12/31/

I can find:

file1 - which contains JSON about object "A" with timestamp "2018-12-31 18:00"

file100 - which contains JSON about object "A" with timestamp "2018-12-31 04:00"

What even worse...the same scenario I have with my second bucket.

What I want to do with this data?

Gather my events from both buckets, ordered by "ID" of object, in a sorted way (by timestamp) to visualize that in timeline at last step (which tools and how it's out of scope).

My doubts are more how to do it:

  • In cost efficient way
  • Cloud native (in AWS)
  • With smallest possible maintenance

What I was thinking of:

  1. Not sure if...but loading every new file which arrived on S3 to DynamoDB (using Lambda triggered). AFAIK Creating table in proper approach - my ID as Hask key and timestamp as Range Key should works for me, correct? As every new row inserted will be partitioned to particular ID, and already ordered in correct manner - but I'm not an expert.

  2. Use Log-stash to load data from S3 to ElasticSearch - again AFAIK everything in ES can be indexed, so also sorted. Timelion will probably allow me to do those fancy analysis I need to created. But again....not sure if ES will perform as I want...price...volume is big etc.

  3. ??? No other ideas

To help somehow understand my need and show a bit data structure I prepared this: :)

example of workflow

Volume of data?

Around +- 200 000 events - each event is a JSON with 4 features (ID, Event_type, Timestamp, Price)

To summarize:

I need put data somewhere effectively, minimizing cost, sorted to maintain at next step front end to present how events are changing based on time - filtered by particular "ID".

Thank and appreciate for any good advice, some best practices, or solutions I can rely on!:)

@John Rotenstein - you are right I absolutely forgot to add those details. Basically I don't need any SQL functionality, as data will be not updated. Only scenario is that new event for particular ID will just arrive, so only new incremental data. Based on that, my only operation I will do on this dataset - is "Select". That's why I would prefer speed and instant answer. People will look at this mostly per each "ID" - so using filtering. Data is arriving every 15 minut on S3 (new files).

@Athar Khan - thanks for good sugestion!


Solution

  • As far as I understand this, I would choose the second option of Elasticsearch, with Logstash loading the data from S3, and Kibana as the tool to investigate, search, sort and visualise.

    Having lambda pushing data from s3 to DynamoDB would probably work, but might be less efficient and cost more, as you are running a compute process on each event, while pushing to Dynamo in small/single-item bulks. Logstash, on the other hand, would read the files one by one and process them all. It also depends on how ofter do you plan to load fresh data to S3, but both solution should fit.

    The fact that the timestamps are not ordered in the files would not make an issue in elasticsearch and you can index them on any order you can, you would still be able to visualise and search them in kibana in a time based sorted order.