Search code examples
pythonalgorithmword-embedding

Community detection for larger than memory embeddings dataset


I currently have a dataset of textual embeddings (768 dimensions). The current number of records is ~1 million. I am looking to detect related embeddings through a community detection algorithm. For small data sets, I have been able to use this one:

https://github.com/UKPLab/sentence-transformers/blob/master/examples/applications/clustering/fast_clustering.py

It works great, but, it doesn't really scale as the data set grows larger than memory.

The key here is that I am able to specify a threshold for community matches. I have been able to find clustering algorithms that scale to larger than memory, but I always have to specify a fixed number of clusters ahead of time. I need the system to detect the number of clusters for me.

I'm certain there are a class of algorithms - and hopefully a python library - that can handle this situation, but I have been unable to locate it. Does anyone know of an algorithm or a solution I could use?


Solution

  • That seems small enough that you could just rent a bigger computer.

    Nevertheless, to answer the question, typically the play is to cluster the data into a few chunks (overlapping or not) that fit in memory and then apply a higher-quality in-memory clustering algorithm to each chunk. One typical strategy for cosine similarity is to cluster by SimHashes, but

    1. there's a whole literature out there;
    2. if you already have a scalable clustering algorithm you like, you can use that.