Search code examples
pythondockerpipbert-language-model

Why ML models installed with pip need to download something else after installation?


Why does this statement download the model? Why isn't it downloaded when I install the package with pip3 install keybert? How can I pre-load it to the docker image so it wouldn't be downloaded every time?

from keybert import KeyBERT
kw_model = KeyBERT()

Right now my dockerfile does the following:

RUN pip install --user -r requirements.txt

requirements.txt:

google-cloud-pubsub==2.8.0
google-cloud-logging==2.6.0
requests==2.28.0
keybert==0.5.1

Solution

  • One potential solution is

    1. Run this code on your local computer to save a copy of the model to a local directory. e.g. save to a directory "keybert"
    from keybert import KeyBERT
    kw_model = KeyBERT()
    kw_model.model.embedding_model.save("keybert")
    
    1. Add the local copy of the model to the Docker image using the COPY command in the Dockerfile
    # Copy local code to the container image.
    COPY ./keybert/ ./keybert/
    
    1. In your script running in the Docker container, load the model from the directory
    from keybert import KeyBERT
    new_kw_model = KeyBERT("./keybert")
    

    The reason for this behavior is that KeyBERT uses other SBERT models, and you can use KeyBERT with more than one model: https://maartengr.github.io/KeyBERT/guides/embeddings.html

    So you'd add a copy of whichever model best suits your purposes to the Docker image