Search code examples
pythongoogle-cloud-dataproc

Cannot create cluster with properties using the dataproc API


I'm trying to create a cluster programmatically in python:

import googleapiclient.discovery

dataproc = googleapiclient.discovery.build('dataproc', 'v1')
zone_uri ='https://www.googleapis.com/compute/v1/projects/{project_id}/zone/{zone}'.format(
  project_id=my_project_id,
  zone=my_zone,
  )
cluster_data = {
  'projectId': my_project_id,
  'clusterName': my_cluster_name,
  'config': {
    'gceClusterConfig': {
      'zoneUri': zone_uri
    },
    'softwareConfig' : {
      'properties' : {'string' : {'spark:spark.executor.memory' : '10gb'}},
    },
  },
}
result = dataproc \
  .projects() \
  .regions() \
  .clusters() \
  .create(
    projectId=my_project_id,
    region=my_region,
    body=cluster_data,
    ) \
  .execute()

And I keep getting this error : Invalid JSON payload received. Unknown name "spark:spark.executor.memory" at 'cluster.config.software_config.properties[0].value': Cannot find field.">

The doc of the API is here : https://cloud.google.com/dataproc/docs/reference/rest/v1/projects.regions.clusters#SoftwareConfig

Property keys are specified in prefix:property format, such as core:fs.defaultFS.

And even when I change the properties to {'string' : {'core:fs.defaultFS' : 'hdfs://'}}, I get that same error.


Solution

  • Properties is a key/value mapping:

    'properties': {
      'spark:spark.executor.memory': 'foo'
    }
    

    The documentation could have had a better example. In general, the best way to find out what the API looks like is to click "Equivalent REST" in the Cloud Console, or --log-http when using gcloud. For example:

    $ gcloud dataproc clusters create clustername --properties spark:spark.executor.memory=foo --log-http
    =======================
    ==== request start ====
    uri: https://dataproc.googleapis.com/v1/projects/projectid/regions/global/clusters?alt=json
    method: POST
    == body start ==
    {"clusterName": "clustername", "config": {"gceClusterConfig": {"internalIpOnly": false, "zoneUri": "us-east1-d"}, "masterConfig": {"diskConfig": {}}, "softwareConfig": {"properties": {"spark:spark.executor.memory": "foo"}}, "workerConfig": {"diskConfig": {}}}, "projectId": "projectid"}
    == body end ==
    ==== request end ====