Search code examples
apache-sparkhdfscluster-computingmapr

Apache Spark: Cluster with nodes of different configuration


I have a production box, where it has 14 nodes. Out of them 14, 12 nodes are of same configuration and 2 of them with higher configuration (almost 3times), so 1> will it impact the over all resource utilization of spark. 2> how can I make use of that extra memory available only from those 2 nodes. 3> Also, if during the process, my rdd > avaiable resource, it will do partiall processing of the task in memory and again load from HDFS remaining data. So how to overcome such scenario to get best performance


Solution

  • There are really three issues raised by your question:

    1) what will the behavior of spark be in distributing computation

    2) how will I/O loads and data be distributed across the cluster

    3) are you using MapR (implied by the tags) or HDFS (implied by tags and the text of your question.

    For 1, depending on how you run Spark, you can usually define some nodes as having more resources than others. If you are using, for instance, the Spark operator that we developed at MapR, you can have quite refined estimates and control.

    For 2, I/O loads and amount of data are generally very well balanced in MapR if you enable the balancer functions. HDFS does not normally do nearly as good a job. This will also depend a bit on your workloads and the history of your cluster. For instance if you have 12 identical nodes that are nearly full and you add two big nodes that are, of course, initially empty, then new data will go to the new nodes until the balancer has time to move data onto the big new nodes. If your new data is what you are primarily analyzing, this can lead to an imbalance in I/O activity.

    In MapR, you can easily avoid this by restricting the locality of new data, but not the old data. This means new data will only fill old nodes while the balancer will move old data to the new nodes. Once you have reasonable balance, you can allow new data to live anywhere.

    For 3, only you can answer. There are obvious and substantial advantages to using MapR for small clusters because you don't have to devote any nodes to being name nodes. There are obvious and substantial advantages to using MapR at large scale as well, of course, but they are different.