python unit-testing machine-learning continuous-integration gitlab

How to use gitlab cache to store model weights for an ML pipeline?

I am using gitlab to host an python-Machine Learning pipeline. The pipeline includes trained weights of some model which I do not want to store in git. The weights are stored in some remote data-storage that the pipeline automatically pulls when running its job.

This works, but I have a problem when trying to run some end-end automatic CI tests on with this setup. I do not want to download the model weights from the remote every time my CI is triggered (since that can get expensive). In fact, I want to completely block out my internet connection within all CI-tests for security reasons (for example by configuring socket in my conftest.py).

If I do this, obviously I am not able to access the location where my model weights are stored. I know I can mock the result of the model for testing, but I actually want to test that the weights of the model is sensible or not. So mocking is out of the question.

I posted a similar question before and one of the solutions that I got was to take advantage of gitlab's caching mechanism to store the model weights.

However, I am not able to figure out how to do that exactly. From what I understand of caching, if I enable it, gitlab will download the necessary files from the internet once and reuse them in later pipelines. However, the solution that I am looking for would look something like this -

Upload a file to gitlab manually.
This file is accessible to all my CI jobs, however, this is not tracked by git.
When the file becomes outdated (because I created a new model), I manually upload the updated file.
With the cache workflow, from what I understand, if I want to update the file, I will have to enable the internet in the testing suite, have the pipeline automatically download the new set of weights, and then disable the internet again once the new cache is set up. This feels hacky and unsafe (unsafe, because I never want to enable internet during testing).

Is there a good solution for this problem?

Solution

One possible solution, but may not flexible enough, is keeping model file in GitLab CI Variables and put into the correct path in the step. GitLab CI supports binary file as a variable as well.