Search code examples
pythonbert-language-model

How to fix random seed for BERTopic?


I'd like to fix the random seed from BERTopic library to get reproducible results. Looking at the code of BERTopic I see it uses numpy. Will using np.random.seed(123) be enough? or do I also need to other libraries as random or pytorch as in this question.


Solution

  • You can fix the random_state variable using UMAP, but you have to also send the other default parameters to the UMAP constructor or the model will break.

    What this looks like in practice is:

    umap = UMAP(n_neighbors=15,
                n_components=5,
                min_dist=0.0,
                metric='cosine',
                low_memory=False,
                random_state=1337) 
    model = BERTopic(language="multilingual", umap_model=umap)
    topics, probs = model.fit_transform(content)
    

    By default, umap_model is set to None in the BERTopic constructor. Internally if that is not provided, it sets one up with default params here in the code.

    Note that low_memory is a param in both constructors, and if the BERTopic constructor isn't called with that in it, it internally sets it to False.