I have 10 amazon ec2 node cluster used for every day data processing and i want to use all the 10 nodes for each day batch process (2 hours process only) and once the reporting data points got generated then i want to shutdown 5 nodes and make only 5 nodes active rest of the day for cost optimization.
I have a replication factor of 3.
In some scenarios all the 3 data blocks(actual & replication blocks) got stored in those 5 nodes which i am shutting down. Because of which i am not able to read the data properly.
Can i make some settings in cloudera manager to persist specific Database or Specific tables into given nodes, so that i will not have any issue in reading the data with only 5 nodes active.
Or any other suggestions will be appreciated.
You can use rack awareness (virtually) to separate your cluster into 2 "racks", and place your 5 nodes that you shut down regularly on a separate "rack". Replication policy will require that the NN place the replicas on separate racks, if configured. Again, I'm referring to racks in the virtual sense here. That should get you what you want.