solr document-classification text-classification

Document Clustering and Classification in Solr?

I'm building an index of documents in Solr. Documents are non-scientific.

I have a category linked to each document, they can be used for teaching. I would like to assign category for new document upon addition. Documents are added all the time without rebuilding all index.

Also documents can be about same thing, but from different sources, so I'd like to make document clustering. So when document is added - I can search whether I already have such topic in the last N days, if yes - then save cluster ID.

Index size is about 500k documents and rising, so speed is important.

So I want to calculate for each new document: Category ID (based on training with pre-defined documents), Cluster ID (matched only for N days, not the whole index).

Is that possible to make with SOLR? Or it is better to make separate solution (if yes then which one?)

Solution

solr 6.1 and lucene 6.1 has this capability now. It offers knn and naive bayes off the shelves. this is a great post about how to use it in solr: solr based text classification