How to scale down a CrateDB cluster?

For testing, I wanted to shrink my 3 node cluster to 2 nodes, to later go and do the same thing for my 5 node cluster.

However, after following the best practice of shrinking a cluster:

Back up all tables

For all tables: alter table xyz set (number_of_replicas=2) if it was less than 2 before

SET GLOBAL PERSISTENT discovery.zen.minimum_master_nodes = <half of the cluster + 1>;
3 a. If the data check should always be green, set the min_availability to 'full': https://crate.io/docs/reference/configuration.html#graceful-stop

Initiate graceful stop on one node

Wait for the data check to turn green

Repeat from 3.

When done, persist the node configurations in crate.yml: gateway.recover_after_nodes: n discovery.zen.minimum_master_nodes:[![enter image description here][1]][1] (n/2) +1 gateway.expected_nodes: n

My cluster never went back to "green" again, and I also have a critical node check failing.

What went wrong here?

crate.yml:

  ... 
  ################################## Discovery ##################################

  # Discovery infrastructure ensures nodes can be found within a cluster
  # and master node is elected. Multicast discovery is the default.

  # Set to ensure a node sees M other master eligible nodes to be considered
  # operational within the cluster. Its recommended to set it to a higher value
  # than 1 when running more than 2 nodes in the cluster.
  #
  # We highly recommend to set the minimum master nodes as follows:
  #   minimum_master_nodes: (N / 2) + 1 where N is the cluster size
  # That will ensure a full recovery of the cluster state.
  #
  discovery.zen.minimum_master_nodes: 2

  # Set the time to wait for ping responses from other nodes when discovering.
  # Set this option to a higher value on a slow or congested network
  # to minimize discovery failures:
  #
  # discovery.zen.ping.timeout: 3s
  #

  # Time a node is waiting for responses from other nodes to a published
  # cluster state.
  #
  # discovery.zen.publish_timeout: 30s

  # Unicast discovery allows to explicitly control which nodes will be used
  # to discover the cluster. It can be used when multicast is not present,
  # or to restrict the cluster communication-wise.
  # For example, Amazon Web Services doesn't support multicast discovery.
  # Therefore, you need to specify the instances you want to connect to a
  # cluster as described in the following steps:
  #
  # 1. Disable multicast discovery (enabled by default):
  #
  discovery.zen.ping.multicast.enabled: false
  #
  # 2. Configure an initial list of master nodes in the cluster
  #    to perform discovery when new nodes (master or data) are started:
  #
  # If you want to debug the discovery process, you can set a logger in
  # 'config/logging.yml' to help you doing so.
  #
  ################################### Gateway ###################################

  # The gateway persists cluster meta data on disk every time the meta data
  # changes. This data is stored persistently across full cluster restarts
  # and recovered after nodes are started again.

  # Defines the number of nodes that need to be started before any cluster
  # state recovery will start.
  #
  gateway.recover_after_nodes: 3

  # Defines the time to wait before starting the recovery once the number
  # of nodes defined in gateway.recover_after_nodes are started.
  #
  #gateway.recover_after_time: 5m

  # Defines how many nodes should be waited for until the cluster state is
  # recovered immediately. The value should be equal to the number of nodes
  # in the cluster.
  #
  gateway.expected_nodes: 3

Solution

So there are two things that are important:

The number of replicas is essentially the number of nodes you can loose in a typical setup (2 is recommended so that you can scale down AND loose a node in the process and still be ok)
The procedure is recommended for clusters > 2 nodes ;)

CrateDB will automatically distribute the shards across the cluster in a way that no replica and primary share a node. If that is not possible (which is the case if you have 2 nodes and 1 primary with 2 replicas, the data check will never return to 'green'. So in your case, set the number of replicas to 1 in order to get the cluster back to green (alter table mytable set (number_of_replicas = 1)).

The critical node check is due to the cluster not having received an updated crate.yml yet: Your file also still has the configuration of a 3-node cluster in it, hence the message. Since CrateDB only loads the expected_nodes at startup (it's not a runtime setting), a restart of the whole cluster is required to conclude scaling down. It can be done with a rolling restart, but be sure to set SET GLOBAL PERSISTENT discovery.zen.minimum_master_nodes = <half of the cluster + 1>; properly, otherwise the consensus will not work...

Also, it's recommended to scale down one-by-one in order to avoid overloading the cluster with rebalancing and accidentally loosing data.