Search code examples
pythonanacondacondaminiconda

Long creation time with conda env create -f environment.yml


I have the following environment.yml file. It is taking 1.5 hours to create this environment. How to improve (or debug) the creation time?

name: test_syn_spark_3_3_1
channels:
  - defaults
  - conda-forge
dependencies:
  - python=3.10
  - pandas=1.5
  - pip=23.0
  - pyarrow=11.0.0
  - pyspark=3.3.1
  - setuptools=65.0
  - pip:
      - azure-common==1.1.28
      - azure-core==1.26.1
      - azure-datalake-store==0.0.51
      - azure-identity==1.7.0
      - azure-mgmt-core==1.3.2
      - azure-mgmt-resource==21.2.1
      - azure-mgmt-storage==20.1.0
      - azure-storage-blob==12.16.0
      - azure-mgmt-authorization==2.0.0
      - azure-mgmt-keyvault==10.1.0
      - azure-storage-file-datalake==12.11.0
      - check-wheel-contents==0.4.0
      - pyarrowfs-adlgen2==0.2.4
      - wheel-filename==1.4.1

Solution

  • Switch the channel order and use Mamba. Specifically, I note that pyspark=3.3.1 is only available from Conda Forge, so the conda-forge channel should go first to avoid any channel_priority: strict masking issues. Mamba is faster, gives clearer error reporting, and the maintainers are very responsive.

    test_syn_spark_3_3_1.yaml

    name: test_syn_spark_3_3_1
    channels:
      - conda-forge
      - defaults
    # rest the same...
    

    Create with Mamba (or micromamba):

    ## install mamba if needed
    ## conda install -n base -c conda-forge mamba
    mamba env create -n test_syn_spark_3_3_1 -f test_syn_spark_3_3_1.yaml
    

    This runs in a few minutes on my machine, which is mostly downloading time.


    Other Thoughts

    1. I wouldn't ever impose a fixed constraint on pip or setuptools unless there is a specific bug you are avoiding. I'd probably at least loosen to use lower bounds.
    2. Conda Forge is fully self-sufficient these days - I would not only drop defaults, but even insulate against any channel mixing with nodefaults directive.
    3. I notice defaults channel prefers MKL for BLAS on x64 whereas Conda Forge defaults to OpenBLAS. So, you may want to explicitly declare your preference (e.g., accelerate on macOS arm64, mkl on Intel).

    In summary, this is how I would write the YAML:

    name: test_syn_spark_3_3_1
    channels:
      - conda-forge
      - nodefaults    # insulate from user config
    dependencies:
      ## Python Core
      - python=3.10
      - pip >=23.0
      - setuptool >=65.0
    
      ## BLAS
      ## adjust for hardware/preference
      - blas=*=mkl
    
      ## Conda Python pkgs
      - pandas=1.5
      - pyarrow=11.0.0
      - pyspark=3.3.1
      
      ## PyPI pkgs
      - pip:
        - azure-common==1.1.28
        - azure-core==1.26.1
        - azure-datalake-store==0.0.51
        - azure-identity==1.7.0
        - azure-mgmt-core==1.3.2
        - azure-mgmt-resource==21.2.1
        - azure-mgmt-storage==20.1.0
        - azure-storage-blob==12.16.0
        - azure-mgmt-authorization==2.0.0
        - azure-mgmt-keyvault==10.1.0
        - azure-storage-file-datalake==12.11.0
        - check-wheel-contents==0.4.0
        - pyarrowfs-adlgen2==0.2.4
        - wheel-filename==1.4.1