How to store the chat history in Django + Pusher? Is Tornado or Celery needed?

We are a small startup building a chat service and we are having big trouble to figure out how to storage the chat history. We are also unsure about how necesary Tornado is in our scenario.

We are have a Django app running on Heroku, and we are not yet sure if we should implement a separate Tornado app which forwards messages to Pusher, so right now is the Django app who receives messages from clients and forwards them to Pusher channels.

Our initial architecture looks now like in the picture below:

chattyhive initial architecture

We are using a PostgreSQL to storage user profiles, info about the chat rooms and so on. And we don't know what is now the best approach to store the chat messages. Can we use PostgreSQL also for this? is it possible to use Redis with persistence to store the whole chat history (even if the app grows a lot in number of users) or would be best to use Redis for the most recent messages and PostgreSQL for the whole rest of the history. We are also curious about other NoSQL solutions like couchDB, HBase, etc. which seems to be in the architecture of big apps like hipchat or line, but it seems not many projects are using them in the very beginning and the support for them in Heroku is not the same. Should we look at them if we are planning to have a big growth?

Thats the first part of our headaches, the other part is how important is to use Tornado for the messaging part of our app if we are already using Pusher. And if we do so, what could be a possible approach to combine both apps. If the Tornado app receives and storage the messages: how can we access to this messages from the Django model layer to perform searchs and so on?, can the Tornado app storage the messages in a database that is shared with the Django app?

And finally: how can Celery help? can we stick on Django and use Celery to queue the messages so they are asynchronously delivered to Pusher?

We would be really thankful if you could shed some light on this. We have reasearched quite a lot this week and still nothing really clear! It would be nice to know if we could start from the most simple and make some kind of progression: storing chat history in our PostgreSQL with Django, and then moving to Redis for a cache of the most recent messages, and then maybe integrating Celery and so on. Or if we should go ahead and Implement a Tornado App to handle everything related with messaging from now on!

Solution

Should we use Tornado ?

The underlying question is : would you benefit from Tornado's asynchronous capabilities on requests coming from your clients ? Do you have to wait for async results (like the result of a request to pusher) to produce the http response ?

If you do (that could be feedback you want to send to a client sending a message), then tornado web server will allow you to handle another request while waiting for the needed resource to be fetched asynchronously.

I repeated async a lot of times because it's really the core benefit you can get from tornado, if you don't have or need non blocking async resources to produce a response, tornado will just behave like any other blocking webserver.

Can we use Tornado with django ?

Sure, you can use django's orm, forms, templates ans other parts of the stack from a tornado app. It will go off tracks from django's documentation, but you can find some articles on a tornado+django stack on the web

Tornado and Pusher

What the tornado channel in pusher uses is the tornado async http client.

That's one example of non-blocking asynchronous resource you can use.

Will celery help ?

Celery will enable you to enqueue/schedule jobs asynchronously, like push messages to pusher or to your persistence backend, make a long-running search, schedule pushes of some stats.

It can be used as a non blocking async resource with tornado too. See tornado-celery

What you could try for example is multiplexing jobs to minimize roundtrips over the network where you can, but that's premature optimization :D

Persistence - Postgres, Redis.

What you'll probably have to worry about is partitioning and replication for scalability, to distribute the load on several instances of your persistence backend.

Redis is often said to be very scalable in that way, I'd try it, but it's personal opinion and eagerness rather than experience and benchmarking.

Hope that helps :)