I am trying to install library spark-xml_2.12-0.15.0
using dbx
.
The documentation I found is to include it on the conf/deployment.yml
file like:
custom:
basic-cluster-props: &basic-cluster-props
spark_version: "10.4.x-cpu-ml-scala2.12"
basic-static-cluster: &basic-static-cluster
new_cluster:
<<: *basic-cluster-props
num_workers: 2
build:
commands:
- "mvn clean package" #
environments:
default:
workflows:
- name: "charming-aurora-sample-jvm"
libraries:
- jar: "{{ 'file://' + dbx.get_last_modified_file('target/scala-2.12', 'jar') }}" #
tasks:
- task_key: "main"
<<: *basic-static-cluster
deployment_config: #
no_package: true
spark_jar_task:
main_class_name: "org.some.main.ClassName"
You may see documentation page here: https://dbx.readthedocs.io/en/latest/guides/jvm/jvm_devops/?h=maven
I have installed the library on the cluster using Maven file (https://mvnrepository.com/artifact/com.databricks/spark-xml_2.13/0.15.0):
<!-- https://mvnrepository.com/artifact/com.databricks/spark-xml -->
<dependency>
<groupId>com.databricks</groupId>
<artifactId>spark-xml_2.13</artifactId>
<version>0.15.0</version>
</dependency>
I can use it on a notebook level but not from a job deployed using dbx.
I am using PySpark .
So, I included it as this at conf/deployment.yml
:
libraries:
- maven: "com.databricks:spark-xml_2.12:0.15.0"
On the file conf/deployment.yml
- name: "my-job"
libraries:
- maven:
- coordinates:"com.databricks:spark-xml_2.12:0.15.0"
tasks:
- task_key: "first_task"
<<: *basic-static-cluster
python_wheel_task:
package_name: "project_name"
entry_point: "jl" # take a look at the setup.py entry_points section for details on how to define an entrypoint
parameters: ["--conf-file", "file:fuse://conf/tasks/my_job_config.yml"]
Then I go with
dbx deploy my-job
This throwing the following error:
HTTPError: 400 Client Error: Bad Request for url: https://adb-xxxx.azuredatabricks.net/api/2.0/jobs/reset
Response from server:
{ 'error_code': 'MALFORMED_REQUEST',
'message': "Could not parse request object: Expected 'START_OBJECT' not "
"'START_ARRAY'\n"
' at [Source: (ByteArrayInputStream); line: 1, column: 91]\n'
' at [Source: java.io.ByteArrayInputStream@37fda06f; line: 1, '
'column: 91]'}
You were pretty close, and the error you've run into doesn't really say much. We plan to introduce structure verification to make such that checks are more understandable.
The correct deployment file structure should look as follows:
- name: "my-job"
tasks:
- task_key: "first_task"
<<: *basic-static-cluster
# please note that libraries section is on the task level
libraries:
- maven:
coordinates:"com.databricks:spark-xml_2.12:0.15.0"
python_wheel_task:
package_name: "project_name"
entry_point: "jl" # take a look at the setup.py entry_points section for details on how to define an entrypoint
parameters: ["--conf-file", "file:fuse://conf/tasks/my_job_config.yml"]
Two important points here:
libraries
section is on the task levelmaven
section expects an object, not a list, therefore this will not work:#THIS IS INCORRECT DON'T DO THIS
libraries:
- maven:
- coordinates:"com.databricks:spark-xml_2.12:0.15.0"
But this will:
# correct structure
libraries:
- maven:
coordinates:"com.databricks:spark-xml_2.12:0.15.0"
I've summarized these detail in this new documentation section.