Search code examples
google-cloud-dataprocdataproc

In Dataproc, whether or not the file prefix should be used when applying a property to job?


Actually the document explicitly states:

When applying a property to a job, the file prefix is not used.

However, the example given there is inconsistent with this

This is what the page says:

...However, many of these properties can also be applied to specific jobs. When applying a property to a job, the file prefix is not used. The following example sets Spark executor memory to 4g for a Spark job (spark: prefix omitted).

gcloud dataproc jobs submit spark \
    --region=region \
    --properties=spark.executor.memory=4g \
    ... other args ...

Job properties can be submitted in a file using the gcloud dataproc jobs submit job-type --properties-file flag (see, for example, the --properties-file description for an Hadoop job submission).

gcloud dataproc jobs submit JOB_TYPE \
    --region=region \
    --properties-file=PROPERTIES_FILE \
    ... other args ...

The PROPERTIES_FILE is a set of line-delimited key=value pairs. The property to be set is the key, and the value to set the property to is the value. See the java.util.Properties class for a detailed description of the properties file format.

The following is an example of a properties file that can be passed to the --properties-file flag when submitting a Dataproc job.

dataproc:conda.env.config.uri=gs://some-bucket/environment.yaml
spark:spark.history.fs.logDirectory=gs://some-bucket
spark:spark.eventLog.dir=gs://some-bucket
capacity-scheduler:yarn.scheduler.capacity.root.adhoc.capacity=5

Above the file prefixes are used in job properties


Solution

  • No, property prefixes are only used for cluster properties and do not apply to job properties: https://cloud.google.com/dataproc/docs/concepts/configuring-clusters/cluster-properties