I am using dataproc to submit jobs on spark. However on spark-submit, non-spark arguments are being read as spark arguments!
I am receiving the error/warning below when running a particular job.
Warning: Ignoring non-spark config property: dataproc:dataproc.conscrypt.provider.enable=false
gcloud dataproc jobs submit spark \
--cluster my-cluster \
--region us-east1 \
--properties dataproc:dataproc.conscrypt.provider.enable=false,spark.executor.extraJavaOptions=$SPARK_CONF,spark.executor.memory=${MEMORY}G,spark.executor.cores=$total_cores \
--class com.sample.run \
--jars gs://jars/jobs.jar \
-- 1000
I would like to know whats wrong with my current format. Thanks in advance.
spark-submit
just silently ignored conf options that did not start with spark.
thats the reason for this property it was saying it was ignored.
--properties dataproc:dataproc.conscrypt.provider.enable=false
any property you should pass as spark.
propertyname
this is just warning.
Why this property required :
The Conscrypt security provider has been temporarily changed from the default to an optional security provider. This change was made due to incompatibilities with some workloads. The Conscrypt provider will be re-enabled as the default with the release of Cloud Dataproc 1.2 in the future. In the meantime, you can re-enable the Conscrypt provider when creating a cluster by specifying this Cloud Dataproc property:
--properties
dataproc:dataproc.conscrypt.provider.enable=true
This has to be specified when creating cluster since this is cluster property, not property of spark. (means spark framework cant able to understand this and simply ignored.)
Example usage :
gcloud beta dataproc clusters create my-test
--project my-project
--subnet prod-sub-1
--zone southamerica-east1-a
--region=southamerica-east1
--master-machine-type n1-standard-4
--master-boot-disk-size 40
--num-workers 5
--worker-machine-type n1-standard-4
--worker-boot-disk-size 20
--image-version 1.2
--tags internal,ssh,http-server,https-server
--properties dataproc:dataproc.conscrypt.provider.enable=false
--format=json
--max-idle=10m
and then start job like this...
gcloud dataproc jobs submit pyspark gs://path-to-script/spark_full_job.py
--cluster=my-test
--project=my-project
--region=southamerica-east1
--jars=gs://path-to-driver/mssql-jdbc-6.4.0.jre8.jar
--format=json -- [JOB_ARGS]