Search code examples
amazon-web-servicesapache-sparkhadoopamazon-s3hdfs

Spark - can "spark.deploy.spreadOut = false" give performance benefit on S3


i understand "spark.deploy.spreadOut" when set to true can benefit HDFS, but for S3 can setting to false have a benefit over true?


Solution

  • If you're running Hadoop and HDFS, it would not benefit you to use Spark Standalone scheduler for which that property applies. Rather, you should be running YARN, and the ResourceManager determines how executors are spread

    If you are running Standalone scheduler in EC2, then setting that property will help, and the default is true.

    In other words, where you're reading the data from is not the deciding factor here, the deploy mode for the master is

    The better performance benefits would come from the number of files you're trying to read, and which formats you store the data in