Search code examples
aerospikeaerospike-ce

Aerospike: One of Three Nodes went down abruptly and writes are not happening


We are running 3 Node Cluster, data in memory on version 4.2.0.4 CE on AWS. We recently noticed writes are not happening and found one down. Ideally write should happen. Once we start the node which was down, the writes resumed. We are accessing the Aerospike cluster from outside the AWS.

Found below INFO Logs being printed continuously on two nodes.

INFO (hb): (hb.c:4319) found redundant connections to same node, fds 101 31 - choosing at random

On the other node, no logs being printed and no read/writes happening on asadm stats. Also we have observed that the records are unevenly distributed across the nodes.

Below is the configuration file network section consistent across all servers.

The network stanza for all 3 servers are consistent. Please find below.

network {
    service {
            address any
            port 3000
    }

    heartbeat {

            mode mesh
            port 3002 # Heartbeat port for this node.

            # List one or more other nodes, one ip-address & port per line:
            mesh-seed-address-port 13.xxx.xxx.xxx 3002
            mesh-seed-address-port 13.xxx.xxx.xxx 3002
            mesh-seed-address-port 13.xxx.xxx.xxx 3002

            interval 150
            timeout 10
    }

    fabric {
            port 3001
    }

    info {
            port 3003
    }
}
namespace smpa {
    replication-factor 2
    memory-size 12G
    storage-engine memory
    single-bin true
    high-water-memory-pct 80
    stop-writes-pct 90
}

$ asadm -e "show stat like stop_writes"

~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~Service Statistics (2019-01-24 12:24:42 UTC)~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
NODE                              :   node5.domain.com:3000   node6.domain.com:3000   node7.domain.com:3000   
cluster_clock_skew_stop_writes_sec:   0                               0                               0                               

~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~smpa Namespace Statistics (2019-01-24 12:24:42 UTC)~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
NODE                  :   node5.domain.com:3000   node6.domain.com:3000   node7.domain.com:3000   
clock_skew_stop_writes:   false                           false                           false                           
stop_writes           :   false                           false                           false                           

~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~test Namespace Statistics (2019-01-24 12:24:42 UTC)~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
NODE                  :   node5.domain.com:3000   node6.domain.com:3000   node7.domain.com:3000   
clock_skew_stop_writes:   false                           false                           false                           
stop_writes           :   false                           false                           false   

$ asadm -e "show stat like x_partitions"

Seed:        [('127.0.0.1', 3000, None)]
Config_file: /home/web/.aerospike/astools.conf, /etc/aerospike/astools.conf
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~smpa Namespace Statistics (2019-01-24 12:30:01 UTC)~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
NODE                           :   node5.domain.com:3000   node6.domain.com:3000   node7.domain.com:3000   
migrate_rx_partitions_active   :   0                               0                               0                               
migrate_rx_partitions_initial  :   0                               2749                            0                               
migrate_rx_partitions_remaining:   0                               0                               0                               
migrate_tx_partitions_active   :   0                               0                               0                               
migrate_tx_partitions_imbalance:   0                               0                               0                               
migrate_tx_partitions_initial  :   1396                            0                               1353                            
migrate_tx_partitions_remaining:   0                               0                               0

$ asadm -e "show pmap"

Seed:        [('127.0.0.1', 3000, None)]
Config_file: /home/web/.aerospike/astools.conf, /etc/aerospike/astools.conf
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~Partition Map Analysis (2019-01-24 12:33:39 UTC)~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
     Cluster   Namespace                            Node      Primary    Secondary         Dead   Unavailable   
         Key           .                               .   Partitions   Partitions   Partitions    Partitions   
BEF4A1479187   smpa        node6.domain.com:3000         1382         1367            0             0   
BEF4A1479187   smpa        node7.domain.com:3000         1358         1342            0             0   
BEF4A1479187   smpa        node5.domain.com:3000         1356         1387            0             0   
BEF4A1479187   test        node6.domain.com:3000         1382            0            0             0   
BEF4A1479187   test        node7.domain.com:3000         1358            0            0             0   
BEF4A1479187   test        node5.domain.com:3000         1356            0            0             0   
Number of rows: 6

$ asadm -e "show stat like objects"

Seed:        [('127.0.0.1', 3000, None)]
Config_file: /home/web/.aerospike/astools.conf, /etc/aerospike/astools.conf

~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~Service Statistics (2019-01-24 12:34:09 UTC)~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
NODE                       :   node5.domain.com:3000   node6.domain.com:3000   node7.domain.com:3000   
objects                    :   6478039                         6485049                         9265180                         
sindex_gc_objects_validated:   0                               0                               0                               

~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~smpa Namespace Statistics (2019-01-24 12:34:09 UTC)~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
NODE                 :   node5.domain.com:3000   node6.domain.com:3000   node7.domain.com:3000   
evicted_objects      :   0                               0                               0                               
expired_objects      :   0                               0                               0                               
master_objects       :   2944752                         3456686                         4712696                         
non_expirable_objects:   2943325                         3455765                         4711880                         
non_replica_objects  :   0                               0                               0                               
objects              :   6478039                         6485049                         9265180                         
prole_objects        :   3533287                         3028363                         4552484                         

$ asadm -e "info"

Seed:        [('127.0.0.1', 3000, None)]
Config_file: /home/web/.aerospike/astools.conf, /etc/aerospike/astools.conf
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~Network Information (2019-01-25 06:54:14 UTC)~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
                                                    Node               Node                    Ip       Build   Cluster   Migrations        Cluster     Cluster         Principal   Client     Uptime   
                                                       .                 Id                     .           .      Size            .            Key   Integrity                 .    Conns          .   
ec2-xx-xxx-xxx-xxx.ap-south-1.compute.amazonaws.com:3000   BB9BE0093E32B0A    xx.xxx.xxx.xxx:3000   C-4.2.0.4         3      0.000     3ADA511969DD   True        BB9EAC87115AD0A       59   01:09:24   
ec2-xx-xxx-xxx-xxx.ap-south-1.compute.amazonaws.com:3000   *BB9EAC87115AD0A   xx.xxx.xxx.xxx:3000   C-4.2.0.4         3      0.000     3ADA511969DD   True        BB9EAC87115AD0A       59   01:05:17   
ec2-xx-xxx-xxx-xxx.ap-south-1.compute.amazonaws.com:3000   BB9D4175485B10A    xx.xxx.xxx.xxx:3000   C-4.2.0.4         3      0.000     3ADA511969DD   True        BB9EAC87115AD0A       59   01:14:17   
Number of rows: 3

~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~Namespace Usage Information (2019-01-25 06:54:14 UTC)~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
Namespace                                                       Node     Total   Expirations,Evictions     Stop       Disk    Disk     HWM   Avail%        Mem     Mem    HWM      Stop   
        .                                                          .   Records                       .   Writes       Used   Used%   Disk%        .       Used   Used%   Mem%   Writes%   
smpa        ec2-xx-xxx-xxx-xxx.ap-south-1.compute.amazonaws.com:3000   2.716 M   (0.000,  0.000)         false         N/E   N/E     50      N/E      2.774 GB   24      80     90        
smpa        ec2-xx-xxx-xxx-xxx.ap-south-1.compute.amazonaws.com:3000   2.648 M   (0.000,  0.000)         false         N/E   N/E     50      N/E      2.706 GB   23      80     90        
smpa        ec2-xx-xxx-xxx-xxx.ap-south-1.compute.amazonaws.com:3000   2.709 M   (0.000,  0.000)         false         N/E   N/E     50      N/E      2.767 GB   24      80     90        
smpa                                                                   8.074 M   (0.000,  0.000)                  0.000 B                             8.247 GB                            
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~Namespace Object Information (2019-01-25 06:54:14 UTC)~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
Namespace                                                       Node     Total     Repl                       Objects                   Tombstones             Pending   Rack   
        .                                                          .   Records   Factor    (Master,Prole,Non-Replica)   (Master,Prole,Non-Replica)            Migrates     ID   
        .                                                          .         .        .                             .                            .             (tx,rx)      .   
smpa        ec2-xx-xxx-xxx-xxx.ap-south-1.compute.amazonaws.com:3000   2.716 M   2        (1.375 M, 1.341 M, 0.000)     (0.000,  0.000,  0.000)      (0.000,  0.000)     0      
smpa        ec2-xx-xxx-xxx-xxx.ap-south-1.compute.amazonaws.com:3000   2.648 M   2        (1.311 M, 1.337 M, 0.000)     (0.000,  0.000,  0.000)      (0.000,  0.000)     0      
smpa        ec2-xx-xxx-xxx-xxx.ap-south-1.compute.amazonaws.com:3000   2.709 M   2        (1.351 M, 1.359 M, 0.000)     (0.000,  0.000,  0.000)      (0.000,  0.000)     0      
smpa                                                                   8.074 M            (4.037 M, 4.037 M, 0.000)     (0.000,  0.000,  0.000)      (0.000,  0.000)            

$ asadm -e "show stat like objects"

Seed:        [('127.0.0.1', 3000, None)]
Config_file: /home/web/.aerospike/astools.conf, /etc/aerospike/astools.conf
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~smpa d190122 Set Statistics (2019-01-25 07:07:30 UTC)~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
NODE   :   ec2-xx.xxx.xxx.xxx.ap-south-1.compute.amazonaws.com:3000   ec2-xx.xxx.xxx.xxx.ap-south-1.compute.amazonaws.com:3000   ec2-xx.xxx.xxx.xxx.ap-south-1.compute.amazonaws.com:3000   
objects:   672400                                                     662491                                                     671131                                                     

~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~smpa d190121 Set Statistics (2019-01-25 07:07:30 UTC)~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
NODE   :   ec2-xx.xxx.xxx.xxx.ap-south-1.compute.amazonaws.com:3000   ec2-xx.xxx.xxx.xxx.ap-south-1.compute.amazonaws.com:3000   ec2-xx.xxx.xxx.xxx.ap-south-1.compute.amazonaws.com:3000   
objects:   376064                                                     347232                                                     374700                                                     

~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~smpa d190124 Set Statistics (2019-01-25 07:07:30 UTC)~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
NODE   :   ec2-xx.xxx.xxx.xxx.ap-south-1.compute.amazonaws.com:3000   ec2-xx.xxx.xxx.xxx.ap-south-1.compute.amazonaws.com:3000   ec2-xx.xxx.xxx.xxx.ap-south-1.compute.amazonaws.com:3000   
objects:   629323                                                     617983                                                     628214                                                     

~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~smpa d190123 Set Statistics (2019-01-25 07:07:30 UTC)~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
NODE   :   ec2-xx.xxx.xxx.xxx.ap-south-1.compute.amazonaws.com:3000   ec2-xx.xxx.xxx.xxx.ap-south-1.compute.amazonaws.com:3000   ec2-xx.xxx.xxx.xxx.ap-south-1.compute.amazonaws.com:3000   
objects:   739556                                                     726447                                                     736871                                                     

~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~smpa d190125 Set Statistics (2019-01-25 07:07:30 UTC)~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
NODE   :   ec2-xx.xxx.xxx.xxx.ap-south-1.compute.amazonaws.com:3000   ec2-xx.xxx.xxx.xxx.ap-south-1.compute.amazonaws.com:3000   ec2-xx.xxx.xxx.xxx.ap-south-1.compute.amazonaws.com:3000   
objects:   313800                                                     308814                                                     313320                                                     

~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~Service Statistics (2019-01-25 07:07:30 UTC)~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
NODE                       :   ec2-xx.xxx.xxx.xxx.ap-south-1.compute.amazonaws.com:3000   ec2-xx.xxx.xxx.xxx.ap-south-1.compute.amazonaws.com:3000   ec2-xx.xxx.xxx.xxx.ap-south-1.compute.amazonaws.com:3000   
objects                    :   2731143                                                    2662967                                                    2724236                                                    
sindex_gc_objects_validated:   0                                                          0                                                          0                                                          

~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~smpa Namespace Statistics (2019-01-25 07:07:30 UTC)~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
NODE                 :   ec2-xx.xxx.xxx.xxx.ap-south-1.compute.amazonaws.com:3000   ec2-xx.xxx.xxx.xxx.ap-south-1.compute.amazonaws.com:3000   ec2-xx.xxx.xxx.xxx.ap-south-1.compute.amazonaws.com:3000   
evicted_objects      :   0                                                          0                                                          0                                                          
expired_objects      :   0                                                          0                                                          0                                                          
master_objects       :   1382413                                                    1318579                                                    1358181                                                    
non_expirable_objects:   1382525                                                    1318691                                                    1358445                                                    
non_replica_objects  :   0                                                          0                                                          0                                                          
objects              :   2731143                                                    2662967                                                    2724236                                                    
prole_objects        :   1348730                                                    1344388                                                    1366055                                                    

Solution

  • The issue is, I have provided NATed ips for heartbeat communication. Ideally we need to provide private IP for "mesh-seed-address-port", provided the "access-address" to NATed IP if your client is outside the network. Please go through the above threads if required.

    Here is the clear documentation on how to configure on AWS EC2 instances. https://discuss.aerospike.com/t/aws-ec2-ip-addressing-for-aerospike/2424

    Thanks a lot to kporter, pgupta & ashish-shinde for their valuable help.