Search code examples
postgresqlelasticsearchtrading

Storing and managing Forex trading tick data


I'm building a data visualization system for Forex trading and I'm exploring ways of storing the historical Forex trading tick data that I have.

The data are in the form of currency pair (e.g. USD/CAD) chronological ticks of Ask and Bid prices. At the end of the day I need my data to be indexed in Elasticsearch and what I searching for is the best way to get them there.

I found a couple of approaches online; they start out simple but then get complicated. I'm wondering if adding that extra complexity is worth it. Some of my options are:

  1. Storing tick data on PostgreSQL and then via a plugin sync them to Elasticsearch (here)
  2. Storing tick data on PostgreSQL, push them to Logstash and then to Elasticsearch
  3. Finally, storing tick data on PostgreSQL, push them to Redis, then to Logstash, and then to Elasticsearch

My intuition says that solution No 2 would be the ideal one, but what is considered best practice?


Solution

  • It's a good idea to store your data in a long-term storage DB, such as PostgreSQL or similar. That way you can decide at any time whether you need to change your mappings, add fields, remove fields, change their types, or what have you, and then you can easily rebuild your ES index/indices without too much trouble from your primary source of truth (i.e. PostgreSQL) and you always have clean data in ES.

    I don't know ZomboDB (solution 1) so I can't really speak for it, all I know is that I'm generally not too fond of tying two different technologies together, it makes it hard to upgrade any of them in case you need/must/want to apply patches or benefit from new features in either of them.

    Unless you have big and costly transformations to do on your source data, I feel that solution 3 doesn't bring much, i.e. the additional step of storing data in an intermediary Redis, doesn't bring much in my opinion (your mileage may vary here). It's a good idea to use a temporary store, such as Redis or Kafka, when you may lose data along the pipeline, but in this case, since you have your data in PostgreSQL, you don't really run the risk of losing anything. If at all, you can relaunch your pipeline and rebuild a few days of data.

    That leaves solution 2, which would be fine given the information at hand. Using the Logstash JDBC input, you can easily retrieve the latest changes and forward them to ES every x minutes.