We have cloudera 5.2 and the users would like to start using Spark with its full potential (in distributed mode so it can get advantage of data locality with HDFS), the service is already installed and available in cloudera manager Status(in home page) but when clicking the service and then "Instances" it just shows a History Server role and in other nodes a Gateway server role. From my understanding of Spark's architecture you have a master node and worker nodes(which lives together with HDFS datanodes) so in cloudera manager i tried "Add role instances" but there's only "Gateway" role available . How do you add Sparks worker node(or executor) role to the hosts where you have HDFS datanodes? Or is it unnecessary (i think that because of yarn ,yarn takes charge of creating the executor and application master )? And what's the case of the masternode? Do i need to configure anything so the users can use Spark at its full distributed mode?
The master and worker roles are part of Spark Standalone service. You can either choose Spark to run with YARN (in which Master and Worker nodes are irrelevant) or the Spark (Standalone).
As you have started the Spark service instead of Spark (Standalone) in Cloudera Manager, Spark is already using YARN. In Cloudera Manager 5.2 and higher, there are two separate Spark services (Spark and Spark (Standalone)). The Spark service runs Spark as a YARN application with only gateway roles in addition to the Spark History Server role.
How do you add Sparks worker node(or executor) role to the hosts where you have HDFS datanodes?
Not required. Only Gateway roles are required on these hosts.
Quoting from CM Documentation:
In Cloudera Manager Gateway roles take care of propagation of client configurations to the other hosts in your cluster. So, ensure that you assign the gateway roles to hosts in the cluster. If you do not have gateway roles, client configurations are not deployed.