I have the following environment.yml
file. It is taking 1.5 hours to create this environment. How to improve (or debug) the creation time?
name: test_syn_spark_3_3_1
channels:
- defaults
- conda-forge
dependencies:
- python=3.10
- pandas=1.5
- pip=23.0
- pyarrow=11.0.0
- pyspark=3.3.1
- setuptools=65.0
- pip:
- azure-common==1.1.28
- azure-core==1.26.1
- azure-datalake-store==0.0.51
- azure-identity==1.7.0
- azure-mgmt-core==1.3.2
- azure-mgmt-resource==21.2.1
- azure-mgmt-storage==20.1.0
- azure-storage-blob==12.16.0
- azure-mgmt-authorization==2.0.0
- azure-mgmt-keyvault==10.1.0
- azure-storage-file-datalake==12.11.0
- check-wheel-contents==0.4.0
- pyarrowfs-adlgen2==0.2.4
- wheel-filename==1.4.1
Switch the channel order and use Mamba. Specifically, I note that pyspark=3.3.1
is only available from Conda Forge, so the conda-forge
channel should go first to avoid any channel_priority: strict
masking issues. Mamba is faster, gives clearer error reporting, and the maintainers are very responsive.
test_syn_spark_3_3_1.yaml
name: test_syn_spark_3_3_1
channels:
- conda-forge
- defaults
# rest the same...
Create with Mamba (or micromamba):
## install mamba if needed
## conda install -n base -c conda-forge mamba
mamba env create -n test_syn_spark_3_3_1 -f test_syn_spark_3_3_1.yaml
This runs in a few minutes on my machine, which is mostly downloading time.
pip
or setuptools
unless there is a specific bug you are avoiding. I'd probably at least loosen to use lower bounds.defaults
, but even insulate against any channel mixing with nodefaults
directive.defaults
channel prefers MKL for BLAS on x64 whereas Conda Forge defaults to OpenBLAS. So, you may want to explicitly declare your preference (e.g., accelerate
on macOS arm64, mkl
on Intel).In summary, this is how I would write the YAML:
name: test_syn_spark_3_3_1
channels:
- conda-forge
- nodefaults # insulate from user config
dependencies:
## Python Core
- python=3.10
- pip >=23.0
- setuptool >=65.0
## BLAS
## adjust for hardware/preference
- blas=*=mkl
## Conda Python pkgs
- pandas=1.5
- pyarrow=11.0.0
- pyspark=3.3.1
## PyPI pkgs
- pip:
- azure-common==1.1.28
- azure-core==1.26.1
- azure-datalake-store==0.0.51
- azure-identity==1.7.0
- azure-mgmt-core==1.3.2
- azure-mgmt-resource==21.2.1
- azure-mgmt-storage==20.1.0
- azure-storage-blob==12.16.0
- azure-mgmt-authorization==2.0.0
- azure-mgmt-keyvault==10.1.0
- azure-storage-file-datalake==12.11.0
- check-wheel-contents==0.4.0
- pyarrowfs-adlgen2==0.2.4
- wheel-filename==1.4.1