Search code examples
hadoopcassandradatastax-enterprisedatastax

Datastax hadoop nodes basics


I'm trying to set up some hadoop nodes along with some cassandra nodes in my datastax enterprise cluster. Two things are not clear to me at this point. One, how many hadoop nodes do I need? Is it the same number of cassandra nodes? Does the data still live on the cassandra nodes? Second--the tutorials mention that I should have vnodes disabled on the hadoop nodes. Can I still use vnodes on the cassandra nodes in that cluster? Thank you.


Solution

  • In Datastax Enterprise you run Hadoop on nodes that are also running Cassandra. The most common deployment is to make two datacenters (logical groupings of nodes.) One Datacenter is devoted to analytics and contains your machines which run Hadoop and C* at the same time, the other datacenter is C* only and servers the OLTP function of your cluster. The C* processes on the Analytics nodes are connected to the rest of your cluster (like any other C* node) and receives updates when mutations are written so it is eventually consistent with the rest of your database. The data lives both on these nodes and on the other nodes in your cluster. Again most folks end up having a replication pattern with NetworkTopologyStrategy which specifies several replicas in their C* only DC and a single replica in their Analytics DC but your usecase may differ. The number of nodes does not have to be equal in the two datacenters.

    For your second question, yes you can have Vnodes enabled in the C* only datacenter. In addition if your batch jobs are of a signficantly large enough size you could also run vnodes in your analytics datacenterr with only a slight performance hit. Again this is completely based on your use case. If you want many faster shorter analytics jobs you do NOT want vnodes enabled in your Analytics datacenter.