Search code examples
azurekuberneteshyperledger-fabricraft

Hyperledger Fabric | Orderer PODs keeps restarting | Error: Was the raft log corrupted, truncated, or lost?


I'm running Hyperledger Fabric network in Azure Kubernetes Cluster. I'm using single Azure Files volume (1000 GB) as my persistent volume.

However, my Orderer POD keeps restarting over and over again.

Orderer POD is logging following error:

2022-02-13 04:40:22.342 UTC 0080 PANI [orderer.consensus.etcdraft] commitTo -> tocommit(8) is out of range [lastIndex(5)]. Was the raft log corrupted, truncated, or lost? channel=system-channel node=3
panic: tocommit(8) is out of range [lastIndex(5)]. Was the raft log corrupted, truncated, or lost?

Following are the detailed logs from Orderer POD:

2022-02-13 04:40:22.342 UTC 007f INFO [orderer.consensus.etcdraft] becomeFollower -> 3 became follower at term 2 channel=system-channel node=3
2022-02-13 04:40:22.342 UTC 0080 PANI [orderer.consensus.etcdraft] commitTo -> tocommit(8) is out of range [lastIndex(5)]. Was the raft log corrupted, truncated, or lost? channel=system-channel node=3
panic: tocommit(8) is out of range [lastIndex(5)]. Was the raft log corrupted, truncated, or lost?
 
go.uber.org/zap.(*SugaredLogger).log(0xc000332e20, 0xf0000000000004, 0x10b467e, 0x5d, 0xc000533b00, 0x2, 0x2, 0x0, 0x0, 0x0)
    /go/src/github.com/hyperledger/fabric/vendor/go.uber.org/zap/sugar.go:234 +0xf6
go.uber.org/zap.(*SugaredLogger).Panicf(...)
    /go/src/github.com/hyperledger/fabric/vendor/go.uber.org/zap/sugar.go:159
github.com/hyperledger/fabric/common/flogging.(*FabricLogger).Panicf(0xc000332e28, 0x10b467e, 0x5d, 0xc000533b00, 0x2, 0x2)
    /go/src/github.com/hyperledger/fabric/common/flogging/zap.go:74 +0x7c
go.etcd.io/etcd/raft.(*raftLog).commitTo(0xc0001a4310, 0x8)
    /go/src/github.com/hyperledger/fabric/vendor/go.etcd.io/etcd/raft/log.go:203 +0x135
go.etcd.io/etcd/raft.(*raft).handleHeartbeat(0xc000aaab40, 0x8, 0x3, 0x1, 0x2, 0x0, 0x0, 0x0, 0x0, 0x0, ...)
    /go/src/github.com/hyperledger/fabric/vendor/go.etcd.io/etcd/raft/raft.go:1324 +0x54
go.etcd.io/etcd/raft.stepFollower(0xc000aaab40, 0x8, 0x3, 0x1, 0x2, 0x0, 0x0, 0x0, 0x0, 0x0, ...)
    /go/src/github.com/hyperledger/fabric/vendor/go.etcd.io/etcd/raft/raft.go:1269 +0x439
go.etcd.io/etcd/raft.(*raft).Step(0xc000aaab40, 0x8, 0x3, 0x1, 0x2, 0x0, 0x0, 0x0, 0x0, 0x0, ...)
    /go/src/github.com/hyperledger/fabric/vendor/go.etcd.io/etcd/raft/raft.go:971 +0x1235
go.etcd.io/etcd/raft.(*node).run(0xc00007e660, 0xc000aaab40)
    /go/src/github.com/hyperledger/fabric/vendor/go.etcd.io/etcd/raft/node.go:357 +0xd78
created by go.etcd.io/etcd/raft.StartNode
    /go/src/github.com/hyperledger/fabric/vendor/go.etcd.io/etcd/raft/node.go:233 +0x409
2022-02-13 04:45:28.330 UTC 0001 WARN [localconfig] completeInitialization -> General.GenesisFile should be replaced by General.BootstrapFile
2022-02-13 04:45:28.331 UTC 0002 INFO [localconfig] completeInitialization -> Kafka.Version unset, setting to 0.10.2.0
2022-02-13 04:45:28.331 UTC 0003 INFO [orderer.common.server] prettyPrintStruct -> Orderer config values:
    General.ListenAddress = "0.0.0.0"
    General.ListenPort = 7050
    General.TLS.Enabled = true
    General.TLS.PrivateKey = "/organizations/ordererOrganizations/example.com/orderers/orderer3.example.com/tls/server.key"
    General.TLS.Certificate = "/organizations/ordererOrganizations/example.com/orderers/orderer3.example.com/tls/server.crt"
    General.TLS.RootCAs = [/organizations/ordererOrganizations/example.com/orderers/orderer3.example.com/tls/ca.crt]
    General.TLS.ClientAuthRequired = false
    General.TLS.ClientRootCAs = []
    General.TLS.TLSHandshakeTimeShift = 0s
    General.Cluster.ListenAddress = ""
    General.Cluster.ListenPort = 0
    General.Cluster.ServerCertificate = ""
    General.Cluster.ServerPrivateKey = ""
    General.Cluster.ClientCertificate = "/organizations/ordererOrganizations/example.com/orderers/orderer3.example.com/tls/server.crt"
    General.Cluster.ClientPrivateKey = "/organizations/ordererOrganizations/example.com/orderers/orderer3.example.com/tls/server.key"
    General.Cluster.RootCAs = []
    General.Cluster.DialTimeout = 5s
    General.Cluster.RPCTimeout = 7s
    General.Cluster.ReplicationBufferSize = 20971520
    General.Cluster.ReplicationPullTimeout = 5s
    General.Cluster.ReplicationRetryTimeout = 5s
    General.Cluster.ReplicationBackgroundRefreshInterval = 5m0s
    General.Cluster.ReplicationMaxRetries = 12
    General.Cluster.SendBufferSize = 10
    General.Cluster.CertExpirationWarningThreshold = 168h0m0s
    General.Cluster.TLSHandshakeTimeShift = 0s
    General.Keepalive.ServerMinInterval = 1m0s
    General.Keepalive.ServerInterval = 2h0m0s
    General.Keepalive.ServerTimeout = 20s
    General.ConnectionTimeout = 0s
    General.GenesisMethod = "file"
    General.GenesisFile = "/system-genesis-block/genesis.block"
    Kafka.Retry.LongTotal = 12h0m0s
    Kafka.Retry.NetworkTimeouts.DialTimeout = 10s
    Kafka.Retry.NetworkTimeouts.ReadTimeout = 10s
    Kafka.Retry.NetworkTimeouts.WriteTimeout = 10s
    Kafka.Retry.Metadata.RetryMax = 3
    Kafka.Retry.Metadata.RetryBackoff = 250ms
    Kafka.Retry.Producer.RetryMax = 3
    Kafka.Retry.Producer.RetryBackoff = 100ms
    Kafka.Retry.Consumer.RetryBackoff = 2s
    Kafka.Verbose = false
    Kafka.Version = 0.10.2.0
    Kafka.TLS.Enabled = false
    Kafka.TLS.PrivateKey = ""
    Kafka.TLS.Certificate = ""
    Kafka.TLS.RootCAs = []
    Kafka.TLS.ClientAuthRequired = false
    Kafka.TLS.ClientRootCAs = []
    Kafka.TLS.TLSHandshakeTimeShift = 0s
    Kafka.SASLPlain.Enabled = false
    Kafka.SASLPlain.User = ""
    Kafka.SASLPlain.Password = ""
    Kafka.Topic.ReplicationFactor = 3
    Debug.BroadcastTraceDir = ""
    Debug.DeliverTraceDir = ""
    Consensus = map[SnapDir:/var/hyperledger/production/orderer/etcdraft/snapshot WALDir:/var/hyperledger/production/orderer/etcdraft/wal]
    Operations.ListenAddress = "127.0.0.1:8443"
    Operations.TLS.Enabled = false
    Operations.TLS.PrivateKey = ""
    Operations.TLS.Certificate = ""
    Operations.TLS.RootCAs = []
    Operations.TLS.ClientAuthRequired = false
    Operations.TLS.ClientRootCAs = []
    Operations.TLS.TLSHandshakeTimeShift = 0s
    Metrics.Provider = "disabled"
    Metrics.Statsd.Network = "udp"
    Metrics.Statsd.Address = "127.0.0.1:8125"
    Metrics.Statsd.WriteInterval = 30s
    Metrics.Statsd.Prefix = ""
    ChannelParticipation.Enabled = false
    ChannelParticipation.MaxRequestBodySize = 1048576
    Admin.ListenAddress = "127.0.0.1:9443"
    Admin.TLS.Enabled = false
    Admin.TLS.PrivateKey = ""
    Admin.TLS.Certificate = ""
    Admin.TLS.RootCAs = []
    Admin.TLS.ClientAuthRequired = true
    Admin.TLS.ClientRootCAs = []
    Admin.TLS.TLSHandshakeTimeShift = 0s
2022-02-13 04:45:28.773 UTC 0004 INFO [orderer.common.server] initializeServerConfig -> Starting orderer with TLS enabled
2022-02-13 04:45:28.822 UTC 0005 INFO [blkstorage] NewProvider -> Creating new file ledger directory at /var/hyperledger/production/orderer/chains
2022-02-13 04:45:28.870 UTC 0006 INFO [orderer.common.server] Main -> Bootstrapping the system channel
2022-02-13 04:45:28.880 UTC 0007 INFO [blkstorage] newBlockfileMgr -> Getting block information from block storage
2022-02-13 04:45:28.920 UTC 0008 INFO [orderer.common.server] initializeBootstrapChannel -> Initialized the system channel 'system-channel' from bootstrap block
2022-02-13 04:45:28.923 UTC 0009 INFO [orderer.common.server] extractSystemChannel -> Found system channel config block, number: 0
2022-02-13 04:45:28.923 UTC 000a INFO [orderer.common.server] selectClusterBootBlock -> Cluster boot block is bootstrap (genesis) block; Blocks Header.Number system-channel=0, bootstrap=0
2022-02-13 04:45:28.926 UTC 000b INFO [orderer.common.server] Main -> Starting with system channel: system-channel, consensus type: etcdraft
2022-02-13 04:45:28.926 UTC 000c INFO [orderer.common.server] Main -> Setting up cluster
2022-02-13 04:45:28.926 UTC 000d INFO [orderer.common.server] reuseListener -> Cluster listener is not configured, defaulting to use the general listener on port 7050
2022-02-13 04:45:28.959 UTC 000e INFO [orderer.common.cluster] loadVerifier -> Loaded verifier for channel system-channel from config block at index 0
2022-02-13 04:45:28.960 UTC 000f INFO [certmonitor] trackCertExpiration -> The enrollment certificate will expire on 2023-02-10 09:12:00 +0000 UTC
2022-02-13 04:45:28.960 UTC 0010 INFO [certmonitor] trackCertExpiration -> The server TLS certificate will expire on 2023-02-10 09:13:00 +0000 UTC
2022-02-13 04:45:28.960 UTC 0011 INFO [certmonitor] trackCertExpiration -> The client TLS certificate will expire on 2023-02-10 09:13:00 +0000 UTC
2022-02-13 04:45:28.966 UTC 0012 INFO [orderer.consensus.etcdraft] HandleChain -> EvictionSuspicion not set, defaulting to 10m0s
2022-02-13 04:45:28.966 UTC 0013 INFO [orderer.consensus.etcdraft] HandleChain -> With system channel: after eviction InactiveChainRegistry.TrackChain will be called
2022-02-13 04:45:28.966 UTC 0014 INFO [orderer.consensus.etcdraft] createOrReadWAL -> No WAL data found, creating new WAL at path '/var/hyperledger/production/orderer/etcdraft/wal/system-channel' channel=system-channel node=3
2022-02-13 04:45:28.983 UTC 0015 INFO [orderer.commmon.multichannel] initSystemChannel -> Starting system channel 'system-channel' with genesis block hash ac270210ec2258b99948adc06552a14d49463c3457933b1e24a151502c6487e5 and orderer type etcdraft
2022-02-13 04:45:28.983 UTC 0016 INFO [orderer.consensus.etcdraft] Start -> Starting Raft node channel=system-channel node=3
2022-02-13 04:45:28.983 UTC 0017 INFO [orderer.common.cluster] Configure -> Entering, channel: system-channel, nodes: [ID: 1,
  1. Are there any config/env vars we can set in configtx.yaml or orderer-deployment.yaml to avoid this problem?

  2. Are there any timeouts etc. we can increase/set in configtx.yaml or orderer-deployment.yaml to avoid this problem?

  3. One more observation I made is, when we delete (manually) the Orderer POD (one or two) and when it spins up new Orderer PODs in place automatically, they will keep restarting with the same error.


Solution

  • Turns out my WAL logs directory was deleted. Anyone landing on this question, please set following (if not already set) ENV variables on your Orderer deployments:

    - name: ORDERER_CONSENSUS_SNAPDIR
      value: /var/hyperledger/production/orderer/etcdraft/snapshot
    - name: ORDERER_CONSENSUS_WALDIR
      value: /var/hyperledger/production/orderer/etcdraft/wal
    

    If you already have above ENV variables set on your Orderer deployments then check if the directories are empty for restarting Orderers at above locations. If they are empty, it means your WAL logs were deleted. Orderer can't recover if WAL logs are not available.