I'm trying to develop an program in Python that can process raw chat data and cluster sentences with similar intents so they can be used as training examples to build a new chatbot. The goal is to make it as quick and automatic (i.e. no parameters to enter manually) as possible.
1- For feature extraction, I tokenize each sentence, stem its words and vectorize it using Sklearn's TfidfVectorizer.
2- Then I perform clustering on those sentence vectors with Sklearn's DBSCAN. I chose this clustering algorithm because it doesn't require the user to specify the desired number of clusters (like the k parameter in k-means). It throws away a lot of sentences (considering them as outliers), but at least its clusters are homogeneous.
The overall algorithm works on relatively small datasets (10000 sentences) and generates meaningful clusters, but there are a few issues:
On large datasets (e.g. 800000 sentences), DBSCAN fails because it requires too much memory, even with parallel processing on a powerful machine in the cloud. I need a less computationally-expensive method, but I can't find another algorithm that doesn't make weird and heterogeneous sentence clusters. What other options are there? What algorithm can handle large amounts of high-dimensional data?
The clusters that are generated by DBSCAN are sentences that have similar wording (due to my feature extraction method), but the targeted words don't always represent intents. How can I improve my feature extraction so it better captures the intent of a sentence? I tried Doc2vec but it didn't seem to work well with small datasets made of documents that are the size of a sentence...
A standard implementation of DBSCAN is supposed to need only O(n) memory. You cannot get lower than this memory requirement. But I read somewhere that sklearn's DBSCAN actually uses O(n²) memory, so it is not the optimal implementation. You may need to implement this yourself then, to use less memory.
Don't expect these methods to be able to cluster "by intent". There is no way an unsupervised algorithm can infer what is intended. Most likely, the clusters will just be based on a few key words. But this could be whether people say "hi" or "hello". From an unsupervised point of view, this distinction gives two nice clusters (and some noise, maybe also another cluster "hola").
I suggest to train a supervised feature extraction based on a subset where you label the "intent".