Search code examples
apache-sparkspark-streamingdatastaxdatastax-enterprisemesos

How to best manage all my nodes CPU, memory and storage with Datastax spark?


I now have a cluster of 4 spark nodes and 1 solr node and use cassandra as my database. I want to increase the nodes in the medium term to 20 and in the long term to 100. But Datastax doesn't seem to support Mesos or Yarn. How would I best manage all these nodes CPU, memory and storage? Is Mesos even necessary with 20 or 100 nodes? So far I couldn't find any example of this using datastax. I usually do not have jobs that need to be completed but I am running a continuous stream of data. That's why I am even thinking of deleting Datastax because I couldn't manage this many nodes efficiently without YARN or Mesos in my opinion, but maybe there is a better solution I haven't thought of? Also I am using python so apparently Yarn is my only option.

If you have any suggestions or best practice examples let me know.

Thanks!


Solution

  • If you want to run DSE with a supported Hadoop/Yarn environmet you need to use BYOH, read about it HERE In BYOH you can either run the internal Hadoop platform in DSE or you can run a Cloudera or HDP platform with YARN and anything else that is available.