Search code examples
pythonapache-sparkhadooplogic

Segregate spark and hadoop configuration properties


I have a use case where I want to segregate the spark config properties and hadoop config properties from the spark-submit command.

Example spark-submit command:

/usr/lib/spark/bin/spark-submit --master yarn --class com.benchmark.platform.TPCDSBenchmark --deploy-mode cluster --conf spark.executor.instances=5 --conf spark.dynamicAllocation.minExecutors=2 --conf spark.dynamicAllocation.maxExecutors=5 --conf spark.executor.cores=4 --conf spark.executor.memory=10240M --conf spark.driver.memory=8192M --conf spark.hadoop.hive.metastore.uris=thrift://METASTORE_URI:10016 --conf spark.hadoop.fs.s3a.aws.credentials.provider=org.apache.hadoop.fs.s3a.auth.AssumedRoleCredentialProvider --conf spark.hadoop.fs.s3a.assumed.role.credentials.provider=com.amazonaws.auth.WebIdentityTokenCredentialsProvider --conf spark.hadoop.fs.s3a.assumed.role.arn=arn:aws:iam::ACCOUNT:ROLE s3://JAR_PATH.jar --iterations=2 --queryFilter=q1-v2.4

I want to extract spark_conf and hadoop_conf from the above command.

Sample output:

"spark_conf": {
      "spark.driver.memory": "8192M",
      "spark.executor.cores": "4",
      "spark.executor.memory": "10240M",
      "spark.executor.instances": "5",
      "spark.dynamicAllocation.maxExecutors": "5",
      "spark.dynamicAllocation.minExecutors": "2"
    }
"hadoop_conf": {
      "spark.hadoop.hive.metastore.uris": "thrift://METASTORE_URI:10016",
      "spark.hadoop.fs.s3a.assumed.role.arn": "arn:aws:iam::ACCOUNT:ROLE",
      "spark.hadoop.fs.s3a.aws.credentials.provider": "org.apache.hadoop.fs.s3a.auth.AssumedRoleCredentialProvider",
      "spark.hadoop.fs.s3a.assumed.role.credentials.provider": "com.amazonaws.auth.WebIdentityTokenCredentialsProvider"
    }

The comprehensive list of hadoop related config properties is available here: list1 list2 list3 list4. Remaining config properties can be assigned to spark. I don't want to save these hundreds of properties in a database and search for a match. Is there a better way to segregate between the two types of config properties?


Solution

  • hadoop code will read the contents of the file in core-site.xml (on the classpath; hadoop XML) before spark starts overriding things. All site settings go in there on managed clusters.

    Spark always reads the file conf/spark-defaults.conf on the classpath docs

    If you are tuning s3a connections, know that the normative list of settings (and defaults, unless changed in core-default.xml ) are in: Constants.java

    Documentation is in hadoop-aws. There are some which aren't in the docs; mostly by accident. Look at those related to executor thread pool size and ask YARN for some more cores if you want maximum S3 IO performance

    Finally, that webidentity credential provider reads the env var AWS_WEB_IDENTITY_TOKEN_FILE which must point to a file in the local FS. It is not passed round with jobs, so you'll need a way to get it across the cluster. But I guess you've already noticed that.