I have 7 datanodes and 1 namenode. Our every node had 32 Gb of memory and 20 cores. So I set container memory to 30 Gb and container virtual CPU cores to 18.
However, only three datanodes work and the rest of datanodes don't work.
Below code is my setting.
/opt/spark/bin/spark-submit \
--master yarn \
--deploy-mode cluster \
--driver-memory 4g \
--driver-cores 18 \
--executor-memory 8g \
--executor-cores 18 \
--num-executors 7 \
Java code
SQLContext sqlc = new SQLContext(spark);
Dataset<Row> df = sqlc.read()
.format("com.databricks.spark.csv")
.option("inferSchema", "true")
.load(traFile);
df.repartition(PartitionSize); //PartitionSize = 7
df.persist( StorageLevel.MEMORY_ONLY() );
This is my data information:
and I try a below command
sudo -u hdfs hdfs balancer
However,
I can solve this problem by adding my script,
--conf "spark.locality.wait.node=0"
Below code is my new script,
/opt/spark/bin/spark-submit \
--master yarn \
--deploy-mode cluster \
--driver-memory 4g \
--driver-cores $drivercores \
--executor-memory 8g \
--executor-cores $execores \
--num-executors $exes \
--conf "spark.locality.wait.node=0" \
thanks to this script, all nodes work.