Search code examples
databasesinglestore

MemSQL database not available for reads while leaf node is in replicating mode


I have a CentOS server running a local memsql cluster (aggregator and leaf on the same machine). I have a databse named offers. For some reason, I cannot execute any queries against tables in my database.

Everything was working fine until I tried to add another machine to the cluster. I had the IT team at my place replicate the server I was working on (completely). I went over to the replicated server, deleted the database in question and then registered the server using the memsql-toolbox-config register-node command. Then the database showed it was under the transition state. I restarted memsql using memsql-ops and got to this situation.

Running a simple query yields:

memsql> select * from table;
ERROR 2261 (HY000): Query `select * from table` couldn't be executed because of an in progress failover operation.  Check the status of the leaf nodes in the cluster (error 1049:'Leaf Error (172.26.32.20:3307): Unknown database 'offers_5'')

The output for the the cluster status command is:

memsql> show cluster status;
+---------+--------------+------+----------+-------------+-------------+----------+--------------+-------------+-------------------------+----------------------+----------------------+---------------+-------------------------------------------------+
| Node ID | Host         | Port | Database | Role        | State       | Position | Master Host  | Master Port | Metadata Master Node ID | Metadata Master Host | Metadata Master Port | Metadata Role | Details                                         |
+---------+--------------+------+----------+-------------+-------------+----------+--------------+-------------+-------------------------+----------------------+----------------------+---------------+-------------------------------------------------+
|       1 | 172.26.32.20 | 3306 | cluster  | master      | online      | 0:181    | NULL         |        NULL |                    NULL | NULL                 |                 NULL | Reference     |                                                 |
|       1 | 172.26.32.20 | 3306 | offers   | master      | online      | 0:156505 | NULL         |        NULL |                    NULL | NULL                 |                 NULL | Reference     |                                                 |
|       2 | 172.26.32.20 | 3307 | cluster  | async slave | replicating | 0:180    | 172.26.32.20 |        3306 |                       1 | 172.26.32.20         |                 3306 | Reference     | stage: packet wait, state: x_streaming, err: no |
|       2 | 172.26.32.20 | 3307 | offers   | sync slave  | replicating | 0:156505 | 172.26.32.20 |        3306 |                       1 | 172.26.32.20         |                 3306 | Reference     |                                                 |
+---------+--------------+------+----------+-------------+-------------+----------+--------------+-------------+-------------------------+----------------------+----------------------+---------------+-------------------------------------------------+
4 rows in set (0.00 sec)

So it seems that the the second node is replicating. Also note the details column saying:

stage: packet wait, state: x_streaming, err: no

Running the replication status command gives:

memsql> show replication status;
+--------+----------+------------+--------------+------------------+--------------------+------------------+----------------+----------------+-----------+---------------------------+-------------+-----------------+-------------------+-----------------+---------------+---------------+
| Role   | Database | Master_URI | Master_State | Master_CommitLSN | Master_HardenedLSN | Master_ReplayLSN | Master_TailLSN | Master_Commits | Connected | Slave_URI                 | Slave_State | Slave_CommitLSN | Slave_HardenedLSN | Slave_ReplayLSN | Slave_TailLSN | Slave_Commits |
+--------+----------+------------+--------------+------------------+--------------------+------------------+----------------+----------------+-----------+---------------------------+-------------+-----------------+-------------------+-----------------+---------------+---------------+
| master | cluster  | NULL       | online       | 0:181            | 0:181              | 0:177            | 0:181          |             86 | yes       | 172.26.32.20:3307/cluster | replicating | 0:180           | 0:181             | 0:180           | 0:181         |            84 |
| master | offers   | NULL       | online       | 0:156505         | 0:156505           | 0:156505         | 0:156505       |            183 | yes       | 172.26.32.20:3307/offers  | replicating | 0:156505        | 0:156505          | 0:156505        | 0:156505      |           183 |
+--------+----------+------------+--------------+------------------+--------------------+------------------+----------------+----------------+-----------+---------------------------+-------------+-----------------+-------------------+-----------------+---------------+---------------+
2 rows in set (0.00 sec)

I never initiated any fail over or replication. Anyone knows why this is happening? How could I solve this?

EDIT:

Using memsql-ops I get:

[me@memsql ~]$ memsql-ops memsql-list
 ID       Agent Id  Process State  Cluster State  Role    Host          Port  Version
 33829AF  Af13af7   RUNNING        CONNECTED      MASTER  172.26.32.20  3306  6.5.18
 BBA1B61  Af13af7   RUNNING        CONNECTED      LEAF    172.26.32.20  3307  6.5.18

But with memsql-admin, with the new memsql tools:

[me@memsql ~]$ memsql-admin list-nodes
✘ Failed to list nodes on all hosts: failed to list nodes on 1 host: 
172.26.32.20
No nodes found

Making my question a bit clearer - How can I get my server to respond to queries again? And after I do, How should I act to add another host? Should I clean the replicated server completely of any memsql data?

2nd EDIT:

I managed to solve this problem by delete my database and cluster data, and setting up a new one using the new MemSQL tools, throwing away MemsqlOps. Read my answer.


Solution

  • I managed to set up a working cluster.

    As micahbhakti mentioned in his answer, I tried using only the newer MemSQL tools, instead of the deprecated MemSQL Ops. It required deleting the MemSQL agent existing on both servers and then following the tutorial in the MemSQL documentation. Here are the steps I took for anyone struggling with this issue which is better described as: My MemSQL-Ops-managed-MemSQL-cluster is not responding. How can I upgrade it to a working MemSQL-tools-managed-cluster?

    1. Save what data you can

    The following step is to delete all memsql data, so it would be best if you could save your data. The table data could be stored in CSV files easily with a simple

    SELECT * FROM important_data_containing_table INTO OUTFILE '/home/yourfolder/yourcsvfile.csv';
    

    Do this for all tables containing important data. You could also save the scheme itself. You can do that by viewing and copying to another file all the create queries you used to create the table originally, to re-execute them later. Use this

    SHOW CREATE TABLE your_table_name
    

    The documentation for mysql is described here. It might not be similar to the syntax used in mem, but the above base command works. For exact information, read about MySQL Features Unsupported in MemSQL.

    2. Delete anything to do with Memsql-Ops

    As it is said here about the uninstall command:

    Stops the local MemSQL Ops agent and deletes all its data.

    If MemSQL nodes are already installed in the local host, this command will prompt users to delete those nodes first before proceeding with the uninstall.

    And indeed, if there is a node runnning (in my case there were), you will be prompted to run another command to delete those nodes: memsql-ops memsql-delete --all. This WILL delete all data in your database as said in it's documentation:

    Deletes all data for a MemSQL node. This operation is not reversible and may lead to data loss. Users who want to perform this operation are prompted to explicitly type ‘DELETE’ to be sure of their decision.

    That's why I asked you to save what ever you need :)

    This should be done for each host you want to include in your new shiny cluster.

    3. Follow the instructions to create the new cluster using MemSQL tools

    After you cleaned your servers from the deprecated MemSQL ops agent and data, you can follow the instructions here. I chose to set up a multiple host comprehensive set up. The process will ask you to register your hosts, and then set up the nodes roles (master aggregator, aggreators and leafs), ip addresses, passwords, ports and etc.

    After that, you can try to test the cluster, making changes in one machine and view them in another. Also the output for memsql-admin list-nodes on the deploying machine for my cluster was:

    +------------+------------+--------------+------+---------------+--------------+---------+----------------+--------------------+
    | MemSQL ID  |    Role    |     Host     | Port | Process State | Connectable? | Version | Recovery State | Availability Group |
    +------------+------------+--------------+------+---------------+--------------+---------+----------------+--------------------+
    | AAAAAAAAAA | Master     | 172.26.32.20 | 3306 | Running       | True         | 6.7.16  | Online         |                    |
    | BBBBBBBBBB | Aggregator | 172.26.32.22 | 3306 | Running       | True         | 6.7.16  | Online         |                    |
    | CCCCCCCCCC | Leaf       | 172.26.32.20 | 3307 | Running       | True         | 6.7.16  | Online         | 1                  |
    | DDDDDDDDDD | Leaf       | 172.26.32.22 | 3307 | Running       | True         | 6.7.16  | Online         | 1                  |
    +------------+------------+--------------+------+---------------+--------------+---------+----------------+--------------------+
    

    4. Restore the data

    Re-execute all the create table queries you saved in step 1, and import all data exported to a csv using this syntax:

    LOAD DATA INFILE '/home/yourfolder/yourcsvfile.csv' INTO TABLE your_table;
    

    And that's it! Now you can manage your cluster using the new MemSQL studio that run on the default http://your_deployment_machine:8080.

    Enjoy :)