I have a django Backend (Postgre DB).
Suppose a given table, say A, has charfield called 'message'. Now, what I want to do is find all items in A which have similar 'message' to the 'message' field of a given instance. The similarity will be based on some algorithm. TL;DR I want to find items based on item-item similarity.
The question has 3 parts:
How can I do it? Can I do it in real time (slow) or will I have to precompute similarity between all items in table A. (This might blow up my DB)
How can I find similarity between 'message' fields? Note that the item is more like a 400 char post than a group of keywords. I've come across many algorithms that that calculate string distance, but I don't think that will cut it. I think something TF-IDF followed by cosine similarity is more appropriate.
How do I achieve above in production setting? As in what data-structure should use to optimize request response time vs storage.
This might do the trick:
http://django-haystack.readthedocs.org/en/v2.4.1/searchqueryset_api.html#more-like-this
SearchQuerySet.more_like_this(self, model_instance)
You can pass in an instance of the model, to fetch similar results.