Search code examples
apache-sparkspark-streamingambarihdpdatanode

hadoop cluster + any way to disable spark application to run on specific data nodes


we have Hadoop cluster ( HDP 2.6.5 cluster with ambari , with 25 datanodes machines )

we are using spark streaming application (spark 2.1 run over Hortonworks 2.6.x )

the current situation is that spark streaming applications runs on all datanodes machines

but now we want the spark streaming application to run only on the first 10 datanodes machines

so the others last 15 datanodes machines will be restricted , and spark application will runs only on the first 10 datanodes machines

is this scenario can be done by ambary features or other approach?

for example we found the - https://docs.cloudera.com/HDPDocuments/HDP2/HDP-2.3.2/bk_yarn_resource_mgt/content/configuring_node_labels.html ,

and

http://crazyadmins.com/configure-node-labels-on-yarn/

but not sure if Node Labes can help us


Solution

  • @Jessica Yes, you are absolutely onto the right path. Yarn Node Labels and Yarn Queues are how Ambari Administrators control team level access to portions of the entire yarn cluster. You can start very basic with just a non default queues or get very in-depth with many queues for many different teams. Node labels take it to another level, allow you to map queues and teams to nodes specifically.

    Here is a post with the syntax for spark to use the yarn queue:

    How to choose the queue for Spark job using spark-submit?

    I tried to find 2.6 version of these docs, but was not able.... they have really mixed up the docs since the merger...

    https://docs.cloudera.com/HDPDocuments/HDP2/HDP-2.3.2/bk_yarn_resource_mgt/content/ch_node_labels.html

    https://docs.cloudera.com/HDPDocuments/HDP3/HDP-3.1.0/data-operating-system/content/configuring_node_labels.html

    The actual steps you may have to take may be a combination of items from both. Typical experience for me when working in Ambari HDP/HDF.