Search code examples
mysqlreplicationfailover

MySQL replication: failover scenario


I have the following MySQL instances, along with a replication setup:

S1 -----> (M1 <--> M2), where:

M1 - M2 is a multi-master replication setup,

S1 - a slave which replicates the writes which are done at Master M1.

Now, I'm trying to enhance the setup with a channel failover mechanism, where S1 would start replicating from M2, should M1 go down. Currently, the only way of doing this that I see is:

(M1 failure detection mechanism on S1 machine), then:

-> S1 gets the latest timestamp of M1's queries from the local relay log file.

-> M2 searches (bash script using mysqlbinlog utility) for the local binlog file + binlog index which corresponds to S1's latest timestamp

-> S1 can finally do a "STOP SLAVE", "CHANGE MASTER TO master_host=M2... master_log_file=... master_log_pos=...", etc. command to continue replication, but from M2 this time

Is there a better (and less error prone) way of doing this?

Thank you

EDIT: Nowadays, this is much easier to achieve thanks to the unique Xid binlog query tags commonly used by the publicly accessible MySQL clustering solutions.


Solution

  • There is a more simplistic way to retrieve the binlog and position needed.

    Would it make more sense to just use the current binlog and position as M2 knows it ? You need to check the Slave status on M2.

    Example

    mysql> show slave status\G
    *************************** 1. row ***************************
                   Slave_IO_State: Waiting for master to send event
                      Master_Host: 10.64.51.130
                      Master_User: replicant
                      Master_Port: 3306
                    Connect_Retry: 60
                  Master_Log_File: mysql-bin.000463
              Read_Master_Log_Pos: 453865699
                   Relay_Log_File: relay-bin.001226
                    Relay_Log_Pos: 453865845
            Relay_Master_Log_File: mysql-bin.000463
                 Slave_IO_Running: Yes
                Slave_SQL_Running: Yes
                  Replicate_Do_DB:
              Replicate_Ignore_DB: search_cache
               Replicate_Do_Table:
           Replicate_Ignore_Table:
          Replicate_Wild_Do_Table:
      Replicate_Wild_Ignore_Table:
                       Last_Errno: 0
                       Last_Error:
                     Skip_Counter: 0
              Exec_Master_Log_Pos: 453865699
                  Relay_Log_Space: 453866038
                  Until_Condition: None
                   Until_Log_File:
                    Until_Log_Pos: 0
               Master_SSL_Allowed: No
               Master_SSL_CA_File:
               Master_SSL_CA_Path:
                  Master_SSL_Cert:
                Master_SSL_Cipher:
                   Master_SSL_Key:
            Seconds_Behind_Master: 0
    Master_SSL_Verify_Server_Cert: No
                    Last_IO_Errno: 0
                    Last_IO_Error:
                   Last_SQL_Errno: 0
                   Last_SQL_Error:
      Replicate_Ignore_Server_Ids:
                 Master_Server_Id: 106451130
    1 row in set (0.00 sec)
    
    mysql>
    

    For this display, there are five crucial components:

    • Master_Log_File (Line 6) : Log File on the Master whose position was last read
    • Read_Master_Log_Pos (Line 7) : Last position read on the Slave from the Master
    • Relay_Master_Log_File (Line 10) : Log File on the Master whose position was last executed
    • Exec_Master_Log_Pos (Line 22) : Last position executed on the Slave from the Master
    • Relay_Log_Space (Line 23) : Sum of bytes from all relay logs

    Relay_Log_Space

    Please note Relay_Log_Space. Once this number stops incrementing, every SQL statement imported from the Master has been read. Unfortunately, it is possible that the last relay log may be corrupt or simply incomplete because of a sudden failover.

    Replication Coordinates

    Please also note that the Replication Coordinates (Relay_Master_Log_File, Exec_Master_Log_Pos). This is the position you are hunting for. However, like Relay_Log_Space it may still be incrementing. In fact, those Replication Coordinates should be equal to the other Replication Coordinates (Master_Log_File,Read_Master_Log_Pos ). That's when you know everything is caught up. If the pair of Replication Coordinates never meet, then you should rely on Relay_Log_Space a little more in terms of when it stops incrementing.

    What about Seconds_Behind_Master ?

    The reason you cannot use Seconds_Behind_Master is simple. Once a Master goes down hard, all it takes just one Replication thread (Slave_IO_Running or Slave_SQL_Running) to become No and Seconds_Behind_Master turns NULL.