Search code examples
cassandraschedulerepair

cassandra: scheduling nodetool repair best practice


I have several questions regarding to nodetool repair and its scheduling.

Assumption:

  • use partitioner ranges option (-pr)
  • use parallel repair
  • GCgracesecodns is default (10 days)

Q1. what's the best practice to determine the groups with which repair is executed. (a)per node or (b)per table or (c)both?

example:

  • (a) Node 0-2 => Group-1, Node3-5 => Group-2 ... etc
  • (b) Table user => group-1, table videos => Group-2 ...etc
  • (c) mix of a and b

Q2. best pracetice for scheduling repair tasks I think two samples(calendar-base). Any advice or better schedule?

  • IN ... Incremental group-N
  • FN ... Full group-N
  • WN ... Week (1 to 4)
  • M - S ... Monday Tuesday ... Sunday

day M T W Th F St S

W1 I1 I2 I3 I4 I1 I2 F1

W2 I3 I4 I1 I2 I3 I4 F2

W3 I1 I2 I3 I4 I1 I2 F3

W4 I3 I4 I1 I2 I3 I4 F4

day M T W Th F St S

W1 I1 I2 I3 I1 I2 I3 F1

W2 I1 I2 I3 I1 I2 I3 F2

W3 I1 I2 I3 I1 I2 I3 F3

W4 I1 I2 I3 I1 I2 I3 spare

edit for clarity.


Solution

  • Q1. Repair in priority:

    • Nodes that went down for more than 3 hours as they won t get hinted handoff after that.
    • Nodes for which you see dropped mutation via nodetool tpstats
    • Tables for which you run deletes as part of your business logic, to make sure all nodes get the tombstones.

    Q2. It depends on your cluster size and your load. If your cluster can be repaired within 10 days with full repair then stick to it. Incremental repair has the inconvenience of splitting sstables. This will add extra compaction load later.