Search code examples
pythondistributionpypi

How to best distribute python packages with _large_ data dependencies


I am working on a new Python package that depends upon many rather large (>20Mb each) data files. Specifically, the library expects the data files to be in a data/ directory at run time.

Currently, I have them in a "data" directory as part of the distribution package and have my setup.py script configured to install these files on the user's system via python setup.py install. This works for now, but it seems that it would prevent me from uploading the distribution to PyPI given that the tarball would likely exceed a few hundred Mb.

As an alternative, I'd like to "host" the files on a remote site so as to be kind to PyPI, and have the files automatically retrieved and installed. Is this possible using the existing Python distribution techniques? If so, could you please describe how to do this or provide an example? If it is not possible, what are the best practices for pulling this off?

Any insight you could offer would be most welcome.


Solution

  • NLTK has a similar situation in the distribution of their corpus data. On my linux distribution, the data is in a separate package, so I did some investigation by installing it with setuptools on Windows.

    If you try to use the corpus and you don't have it, nltk asks you to run the downloader function (nltk.download()). Internally, it uses a LazyCorpusLoader as a standin for the corpus objects that need the data and then loads the data once it's needed.

    Like sys.path it searches a number of paths beforehand so that the user can put it wherever they want. You can also modify nltk.data.path to add your own location for the data.