Search code examples
databricksapache-spark-xmldatabricks-dbx

How to install spark-xml library using dbx


I am trying to install library spark-xml_2.12-0.15.0 using dbx.

The documentation I found is to include it on the conf/deployment.yml file like:

custom:
  basic-cluster-props: &basic-cluster-props
    spark_version: "10.4.x-cpu-ml-scala2.12"

  basic-static-cluster: &basic-static-cluster
    new_cluster:
      <<: *basic-cluster-props
    num_workers: 2

build:
    commands:
        - "mvn clean package" #


environments:
  default:
    workflows:
      - name: "charming-aurora-sample-jvm"
        libraries:
          - jar: "{{ 'file://' + dbx.get_last_modified_file('target/scala-2.12', 'jar') }}" #

        tasks:
          - task_key: "main"
            <<: *basic-static-cluster
            deployment_config: #

              no_package: true
            spark_jar_task:
                main_class_name: "org.some.main.ClassName"

You may see documentation page here: https://dbx.readthedocs.io/en/latest/guides/jvm/jvm_devops/?h=maven

I have installed the library on the cluster using Maven file (https://mvnrepository.com/artifact/com.databricks/spark-xml_2.13/0.15.0):

<!-- https://mvnrepository.com/artifact/com.databricks/spark-xml -->
<dependency>
    <groupId>com.databricks</groupId>
    <artifactId>spark-xml_2.13</artifactId>
    <version>0.15.0</version>
</dependency>

I can use it on a notebook level but not from a job deployed using dbx.

Edit

I am using PySpark .

So, I included it as this at conf/deployment.yml:

libraries:
    - maven: "com.databricks:spark-xml_2.12:0.15.0"

On the file conf/deployment.yml

- name: "my-job"
  libraries:
    - maven: 
      - coordinates:"com.databricks:spark-xml_2.12:0.15.0"
  tasks:
    - task_key: "first_task"
      <<: *basic-static-cluster
      python_wheel_task:
        package_name: "project_name"
        entry_point: "jl" # take a look at the setup.py entry_points section for details on how to define an entrypoint
        parameters: ["--conf-file", "file:fuse://conf/tasks/my_job_config.yml"]

Then I go with

dbx deploy my-job

This throwing the following error:

HTTPError: 400 Client Error: Bad Request for url: https://adb-xxxx.azuredatabricks.net/api/2.0/jobs/reset
 Response from server:
 { 'error_code': 'MALFORMED_REQUEST',
  'message': "Could not parse request object: Expected 'START_OBJECT' not "
             "'START_ARRAY'\n"
             ' at [Source: (ByteArrayInputStream); line: 1, column: 91]\n'
             ' at [Source: java.io.ByteArrayInputStream@37fda06f; line: 1, '
             'column: 91]'}


Solution

  • You were pretty close, and the error you've run into doesn't really say much. We plan to introduce structure verification to make such that checks are more understandable.

    The correct deployment file structure should look as follows:

    - name: "my-job"
      tasks:
        - task_key: "first_task"
          <<: *basic-static-cluster
          # please note that libraries section is on the task level 
          libraries:
            - maven: 
                coordinates:"com.databricks:spark-xml_2.12:0.15.0"
          python_wheel_task:
            package_name: "project_name"
            entry_point: "jl" # take a look at the setup.py entry_points section for details on how to define an entrypoint
            parameters: ["--conf-file", "file:fuse://conf/tasks/my_job_config.yml"]
    

    Two important points here:

    1. libraries section is on the task level
    2. maven section expects an object, not a list, therefore this will not work:
    #THIS IS INCORRECT DON'T DO THIS
    libraries:
      - maven: 
          - coordinates:"com.databricks:spark-xml_2.12:0.15.0"
    

    But this will:

    # correct structure
    libraries:
      - maven: 
          coordinates:"com.databricks:spark-xml_2.12:0.15.0"
    

    I've summarized these detail in this new documentation section.