conda build - recommended way to add heavy test data

I am working on a conda package for a software which test suite needs rather heavy test data (~50MB) The conda documentation indicates how to use test data that are included in the recipe. When the test data are heavy, I would guess that it is better to download them on the fly rather than including the data in the recipe, but what is the best way to declare that in the meta.yml? Should the download and extraction of the data archive be done in build.sh or somewhere else?

Solution

I recommend listing the test data as an additional source to download.

Most conda recipes only download from a single source tarball (or git repository, etc.), but recipes are permitted to list multiple sources if necessary, all of which are downloaded. Here's a quick example:

{% set name = "foo" %}
{% set version = "0.1" %}

package:
  name: {{ name|lower }}
  version: {{ version }}

source:
  # Main source code
  - url: http://example.com/yada/yada/foo-{{ version }}.tar.gz
    sha256: 90e64c6eca4be47bbf1d61f53dc003c6621213738d4ea7a35e5cf1ac2de9bab1

  # Also download test data into a folder named 'test-data'
  - url: http://example.com/yada/yada/my-test-data.tar.gz
    sha256: 3b9c5e0f09ca14a54454319b64af98a02d0ae1b3eb1122c95e2130736f440cd1
    folder: test-data

build:
  number: 0

requirements:
  # etc, etc, ...

test:
  source_files:
    - test-data
  commands:
    - run_my_tests --data-dir=test-data

Notes:

Provide a folder name to specify where your additional source should be unpacked within the work directory. Otherwise, it will be unpacked at the root of the work directory, just like the first source.
The work directory is deleted before the test phase begins, so you'll need to list your test data directory in the test:source_files: section to ensure that it is copied to the folder in which the tests are executed.