Search code examples
apache-sparkpysparkdatabrickspydeequ

Error importing PyDeequ package on databricks


I want to do some tests regarding data quality and for that I pretend to use PyDeequ on a databricks notebook. Keep in mind that I'm very new to databricks and Spark.

First, I created a cluster with the Runtime version "10.4 LTS (includes Apache Spark 3.2.1, Scala 2.12)" and added to the environment variable SPARK_VERSION=3.2, as referred in the repository's GitHub.

Since the available PyPI package is not up to date I tried installing the package through a notebook-scoped library with the following comand %pip install numpy==1.22 %pip install git+https://github.com/awslabs/python-deequ.git (The first line is only to prevent a conflict on the numpy versions.)

Then, when doing import pydeequ I get

IndexError                                Traceback (most recent call last)
<command-3386600260354339> in <module>
----> 1 import pydeequ

/databricks/python_shell/dbruntime/PythonPackageImportsInstrumentation/__init__.py in import_patch(name, globals, locals, fromlist, level)
    165             # Import the desired module. If you’re seeing this while debugging a failed import,
    166             # look at preceding stack frames for relevant error information.
--> 167             original_result = python_builtin_import(name, globals, locals, fromlist, level)
    168 
    169             is_root_import = thread_local._nest_level == 1

/local_disk0/.ephemeral_nfs/envs/pythonEnv-5ccb9322-9b7e-4caf-b370-843c10304472/lib/python3.8/site-packages/pydeequ/__init__.py in <module>
     19 from pydeequ.analyzers import AnalysisRunner
     20 from pydeequ.checks import Check, CheckLevel
---> 21 from pydeequ.configs import DEEQU_MAVEN_COORD
     22 from pydeequ.profiles import ColumnProfilerRunner
     23 

/databricks/python_shell/dbruntime/PythonPackageImportsInstrumentation/__init__.py in import_patch(name, globals, locals, fromlist, level)
    165             # Import the desired module. If you’re seeing this while debugging a failed import,
    166             # look at preceding stack frames for relevant error information.
--> 167             original_result = python_builtin_import(name, globals, locals, fromlist, level)
    168 
    169             is_root_import = thread_local._nest_level == 1

/local_disk0/.ephemeral_nfs/envs/pythonEnv-5ccb9322-9b7e-4caf-b370-843c10304472/lib/python3.8/site-packages/pydeequ/configs.py in <module>
     35 
     36 
---> 37 DEEQU_MAVEN_COORD = _get_deequ_maven_config()
     38 IS_DEEQU_V1 = re.search("com\.amazon\.deequ\:deequ\:1.*", DEEQU_MAVEN_COORD) is not None

/local_disk0/.ephemeral_nfs/envs/pythonEnv-5ccb9322-9b7e-4caf-b370-843c10304472/lib/python3.8/site-packages/pydeequ/configs.py in _get_deequ_maven_config()
     26 
     27 def _get_deequ_maven_config():
---> 28     spark_version = _get_spark_version()
     29     try:
     30         return SPARK_TO_DEEQU_COORD_MAPPING[spark_version[:3]]

/local_disk0/.ephemeral_nfs/envs/pythonEnv-5ccb9322-9b7e-4caf-b370-843c10304472/lib/python3.8/site-packages/pydeequ/configs.py in _get_spark_version()
     21     ]
     22     output = subprocess.run(command, stdout=subprocess.PIPE, stderr=subprocess.PIPE)
---> 23     spark_version = output.stdout.decode().split("\n")[-2]
     24     return spark_version
     25 

IndexError: list index out of range

Any idea on the reason for this or an alternative way to get the library without the PyPI?


Solution

  • I assumed I wouldn't need to add the Deequ library. Apparently, all I had to do was add it via Maven coordinates and it solved the problem.