Search code examples
pythoncluster-analysistopic-modeling

Topic modelling many documents with low memory overhead


I've been working on a topic modelling project using BERTopic 0.16.3, and the preliminary results were promising. However, as the project progressed and the requirements became apparent, I ran into a specific issue with scalability.

Specifically:

  • For development/testing, it needs to train reasonably quickly on a moderate number of documents (tens of thousands to low hundred thousands)
    • Our dev machines are Macs, so this probably has to be done on CPU
  • For production, it needs to train on a large number of documents (several million) without blowing up memory usage
    • For a baseline, with the default settings on my machine, BERTopic has a peak memory usage of roughly 35 kB per document, which easily becomes hundreds of GBs or even TBs for the amount of data that will be provided in production
    • Ideally, this would have peak memory usage sublinear in the number of documents.

That last requirement necessitates batching the documents, since loading them all into memory at once requires linear memory. So, I've been looking into clustering algorithms that work with online topic modelling. BERTopic's documentation suggests scikit-learn's MiniBatchKMeans, but the results I'm getting from that aren't very good.

Some models I've looked at include:

  • Birch via scikit-learn: uses even more memory than BERTopic's default HDBSCAN even when batched. Also runs much slower.
  • IncrementalDBSCAN via incdbscan: Seemed promising at first, but the runtime and eventually memory ballooned. For ~120k documents in batches of 5000, it didn't use more than 4GB of RAM in the first 3½ hours, but didn't finish within ten hours, and used nearly 40GB of RAM at some point in the middle.
  • AgglomerativeClustering via scikit-learn: gave very good results from initial testing (perhaps even better than HDBSCAN), but it doesn't implement the partial_fit method. I found this answer on a different question which suggests it's possible to train two of them using single linkage independently and then merge them, but it gives no indication as to how.

The latter two also don't provide the predict method, limiting their utility.

I am fairly new to the subject, so perhaps I'm approaching this completely wrong and the immediate problem I'm trying to solve has no solution. So to be clear, at the base level, the question I'm trying to answer is: How do I perform topic modelling (and get good results) on a large number of documents without using too much memory?


Solution

  • In general, advanced techniques like UMAP and HDBSCAN are helpful in producing high quality results on larger datasets but will take more memory. Unless it's absolutely required, you may want to consider relaxing this constraint for the sake of performance, real-world human time, and actual cost (hourly instance or otherwise).

    At this scale for a workflow you expect to go to production, rather than trying to work around this in software it may be easier to switch hardware. The GPU-accelerated UMAP and HDBSCAN in cuML can handle this much data very quickly -- quick enough that it's probably worth considering renting a GPU-enabled system if you don't have one locally.

    For the following example, I took a sample of one million Amazon reviews, encoded them into embeddings (384 dimensions), and used the GPU UMAP and HDBSCAN in the current cuML release (v24.08). I ran this on a system with an H100 GPU.

    from bertopic import BERTopic
    from sentence_transformers import SentenceTransformer
    import pandas as pd
    from cuml.manifold.umap import UMAP
    from cuml.cluster import HDBSCAN
    
    df = pd.read_json("Electronics.json.gz", lines=True, nrows=1000000)
    reviews = df.reviewText.tolist()
    
    # Create embeddings
    sentence_model = SentenceTransformer("all-MiniLM-L6-v2")
    embeddings = sentence_model.encode(reviews, batch_size=1024, show_progress_bar=True)
    
    reducer = UMAP(n_components=5)
    %time reduced_embeddings = reducer.fit_transform(embeddings)
    CPU times: user 1min 33s, sys: 7.2 s, total: 1min 40s
    Wall time: 7.31 s
    
    clusterer = HDBSCAN()
    %time clusterer.fit(reduced_embeddings)
    CPU times: user 21.5 s, sys: 125 ms, total: 21.6 s
    Wall time: 21.6 s
    

    There's an example of how to run these steps on GPUs in the BERTopic FAQs.

    I work on these projects at NVIDIA and am a community contributor to BERTopic, so if you run into any issues please let me know and file a Github issue.