I want to do some tests regarding data quality and for that I pretend to use PyDeequ on a databricks notebook. Keep in mind that I'm very new to databricks and Spark.
First, I created a cluster with the Runtime version "10.4 LTS (includes Apache Spark 3.2.1, Scala 2.12)" and added to the environment variable SPARK_VERSION=3.2
, as referred in the repository's GitHub.
Since the available PyPI package is not up to date I tried installing the package through a notebook-scoped library with the following comand
%pip install numpy==1.22 %pip install git+https://github.com/awslabs/python-deequ.git
(The first line is only to prevent a conflict on the numpy versions.)
Then, when doing
import pydeequ
I get
IndexError Traceback (most recent call last)
<command-3386600260354339> in <module>
----> 1 import pydeequ
/databricks/python_shell/dbruntime/PythonPackageImportsInstrumentation/__init__.py in import_patch(name, globals, locals, fromlist, level)
165 # Import the desired module. If you’re seeing this while debugging a failed import,
166 # look at preceding stack frames for relevant error information.
--> 167 original_result = python_builtin_import(name, globals, locals, fromlist, level)
168
169 is_root_import = thread_local._nest_level == 1
/local_disk0/.ephemeral_nfs/envs/pythonEnv-5ccb9322-9b7e-4caf-b370-843c10304472/lib/python3.8/site-packages/pydeequ/__init__.py in <module>
19 from pydeequ.analyzers import AnalysisRunner
20 from pydeequ.checks import Check, CheckLevel
---> 21 from pydeequ.configs import DEEQU_MAVEN_COORD
22 from pydeequ.profiles import ColumnProfilerRunner
23
/databricks/python_shell/dbruntime/PythonPackageImportsInstrumentation/__init__.py in import_patch(name, globals, locals, fromlist, level)
165 # Import the desired module. If you’re seeing this while debugging a failed import,
166 # look at preceding stack frames for relevant error information.
--> 167 original_result = python_builtin_import(name, globals, locals, fromlist, level)
168
169 is_root_import = thread_local._nest_level == 1
/local_disk0/.ephemeral_nfs/envs/pythonEnv-5ccb9322-9b7e-4caf-b370-843c10304472/lib/python3.8/site-packages/pydeequ/configs.py in <module>
35
36
---> 37 DEEQU_MAVEN_COORD = _get_deequ_maven_config()
38 IS_DEEQU_V1 = re.search("com\.amazon\.deequ\:deequ\:1.*", DEEQU_MAVEN_COORD) is not None
/local_disk0/.ephemeral_nfs/envs/pythonEnv-5ccb9322-9b7e-4caf-b370-843c10304472/lib/python3.8/site-packages/pydeequ/configs.py in _get_deequ_maven_config()
26
27 def _get_deequ_maven_config():
---> 28 spark_version = _get_spark_version()
29 try:
30 return SPARK_TO_DEEQU_COORD_MAPPING[spark_version[:3]]
/local_disk0/.ephemeral_nfs/envs/pythonEnv-5ccb9322-9b7e-4caf-b370-843c10304472/lib/python3.8/site-packages/pydeequ/configs.py in _get_spark_version()
21 ]
22 output = subprocess.run(command, stdout=subprocess.PIPE, stderr=subprocess.PIPE)
---> 23 spark_version = output.stdout.decode().split("\n")[-2]
24 return spark_version
25
IndexError: list index out of range
Any idea on the reason for this or an alternative way to get the library without the PyPI?
I assumed I wouldn't need to add the Deequ library. Apparently, all I had to do was add it via Maven coordinates and it solved the problem.