Search code examples
javahadoopapache-sparkhadoop-yarn

Why do only few nodes work in apache spark on yarn?


I have 7 datanodes and 1 namenode. Our every node had 32 Gb of memory and 20 cores. So I set container memory to 30 Gb and container virtual CPU cores to 18.

However, only three datanodes work and the rest of datanodes don't work.

Below code is my setting.

/opt/spark/bin/spark-submit \
--master yarn \
--deploy-mode cluster \
--driver-memory 4g \
--driver-cores 18 \
--executor-memory 8g \
--executor-cores 18 \
--num-executors 7 \

Java code

SQLContext sqlc = new SQLContext(spark);

Dataset<Row> df = sqlc.read()
        .format("com.databricks.spark.csv")
        .option("inferSchema", "true")
        .load(traFile);

df.repartition(PartitionSize);  //PartitionSize = 7
df.persist( StorageLevel.MEMORY_ONLY() );

This is my data information:

this is my data information

and I try a below command

sudo -u hdfs hdfs balancer

However,

Nodes of cluster


Solution

  • I can solve this problem by adding my script,

    --conf "spark.locality.wait.node=0"
    

    Below code is my new script,

    /opt/spark/bin/spark-submit \
    --master yarn \
    --deploy-mode cluster \
    --driver-memory 4g \
    --driver-cores $drivercores \
    --executor-memory 8g \
    --executor-cores $execores \
    --num-executors $exes \
    --conf "spark.locality.wait.node=0" \
    

    thanks to this script, all nodes work.