I am using gitlab to host an python-Machine Learning pipeline. The pipeline includes trained weights of some model which I do not want to store in git. The weights are stored in some remote data-storage that the pipeline automatically pulls when running its job.
This works, but I have a problem when trying to run some end-end automatic CI tests on with this setup. I do not want to download the model weights from the remote every time my CI is triggered (since that can get expensive). In fact, I want to completely block out my internet connection within all CI-tests for security reasons (for example by configuring socket in my conftest.py
).
If I do this, obviously I am not able to access the location where my model weights are stored. I know I can mock the result of the model for testing, but I actually want to test that the weights of the model is sensible or not. So mocking is out of the question.
I posted a similar question before and one of the solutions that I got was to take advantage of gitlab's caching mechanism to store the model weights.
However, I am not able to figure out how to do that exactly. From what I understand of caching, if I enable it, gitlab will download the necessary files from the internet once and reuse them in later pipelines. However, the solution that I am looking for would look something like this -
Is there a good solution for this problem?
One possible solution, but may not flexible enough, is keeping model file in GitLab CI Variables and put into the correct path in the step. GitLab CI supports binary file as a variable as well.