I'm running Hyperledger Fabric network in Azure Kubernetes Cluster. I'm using single Azure Files volume (1000 GB) as my persistent volume.
However, my Orderer POD keeps restarting over and over again.
Orderer POD is logging following error:
2022-02-13 04:40:22.342 UTC 0080 PANI [orderer.consensus.etcdraft] commitTo -> tocommit(8) is out of range [lastIndex(5)]. Was the raft log corrupted, truncated, or lost? channel=system-channel node=3
panic: tocommit(8) is out of range [lastIndex(5)]. Was the raft log corrupted, truncated, or lost?
Following are the detailed logs from Orderer POD:
2022-02-13 04:40:22.342 UTC 007f INFO [orderer.consensus.etcdraft] becomeFollower -> 3 became follower at term 2 channel=system-channel node=3
2022-02-13 04:40:22.342 UTC 0080 PANI [orderer.consensus.etcdraft] commitTo -> tocommit(8) is out of range [lastIndex(5)]. Was the raft log corrupted, truncated, or lost? channel=system-channel node=3
panic: tocommit(8) is out of range [lastIndex(5)]. Was the raft log corrupted, truncated, or lost?
go.uber.org/zap.(*SugaredLogger).log(0xc000332e20, 0xf0000000000004, 0x10b467e, 0x5d, 0xc000533b00, 0x2, 0x2, 0x0, 0x0, 0x0)
/go/src/github.com/hyperledger/fabric/vendor/go.uber.org/zap/sugar.go:234 +0xf6
go.uber.org/zap.(*SugaredLogger).Panicf(...)
/go/src/github.com/hyperledger/fabric/vendor/go.uber.org/zap/sugar.go:159
github.com/hyperledger/fabric/common/flogging.(*FabricLogger).Panicf(0xc000332e28, 0x10b467e, 0x5d, 0xc000533b00, 0x2, 0x2)
/go/src/github.com/hyperledger/fabric/common/flogging/zap.go:74 +0x7c
go.etcd.io/etcd/raft.(*raftLog).commitTo(0xc0001a4310, 0x8)
/go/src/github.com/hyperledger/fabric/vendor/go.etcd.io/etcd/raft/log.go:203 +0x135
go.etcd.io/etcd/raft.(*raft).handleHeartbeat(0xc000aaab40, 0x8, 0x3, 0x1, 0x2, 0x0, 0x0, 0x0, 0x0, 0x0, ...)
/go/src/github.com/hyperledger/fabric/vendor/go.etcd.io/etcd/raft/raft.go:1324 +0x54
go.etcd.io/etcd/raft.stepFollower(0xc000aaab40, 0x8, 0x3, 0x1, 0x2, 0x0, 0x0, 0x0, 0x0, 0x0, ...)
/go/src/github.com/hyperledger/fabric/vendor/go.etcd.io/etcd/raft/raft.go:1269 +0x439
go.etcd.io/etcd/raft.(*raft).Step(0xc000aaab40, 0x8, 0x3, 0x1, 0x2, 0x0, 0x0, 0x0, 0x0, 0x0, ...)
/go/src/github.com/hyperledger/fabric/vendor/go.etcd.io/etcd/raft/raft.go:971 +0x1235
go.etcd.io/etcd/raft.(*node).run(0xc00007e660, 0xc000aaab40)
/go/src/github.com/hyperledger/fabric/vendor/go.etcd.io/etcd/raft/node.go:357 +0xd78
created by go.etcd.io/etcd/raft.StartNode
/go/src/github.com/hyperledger/fabric/vendor/go.etcd.io/etcd/raft/node.go:233 +0x409
2022-02-13 04:45:28.330 UTC 0001 WARN [localconfig] completeInitialization -> General.GenesisFile should be replaced by General.BootstrapFile
2022-02-13 04:45:28.331 UTC 0002 INFO [localconfig] completeInitialization -> Kafka.Version unset, setting to 0.10.2.0
2022-02-13 04:45:28.331 UTC 0003 INFO [orderer.common.server] prettyPrintStruct -> Orderer config values:
General.ListenAddress = "0.0.0.0"
General.ListenPort = 7050
General.TLS.Enabled = true
General.TLS.PrivateKey = "/organizations/ordererOrganizations/example.com/orderers/orderer3.example.com/tls/server.key"
General.TLS.Certificate = "/organizations/ordererOrganizations/example.com/orderers/orderer3.example.com/tls/server.crt"
General.TLS.RootCAs = [/organizations/ordererOrganizations/example.com/orderers/orderer3.example.com/tls/ca.crt]
General.TLS.ClientAuthRequired = false
General.TLS.ClientRootCAs = []
General.TLS.TLSHandshakeTimeShift = 0s
General.Cluster.ListenAddress = ""
General.Cluster.ListenPort = 0
General.Cluster.ServerCertificate = ""
General.Cluster.ServerPrivateKey = ""
General.Cluster.ClientCertificate = "/organizations/ordererOrganizations/example.com/orderers/orderer3.example.com/tls/server.crt"
General.Cluster.ClientPrivateKey = "/organizations/ordererOrganizations/example.com/orderers/orderer3.example.com/tls/server.key"
General.Cluster.RootCAs = []
General.Cluster.DialTimeout = 5s
General.Cluster.RPCTimeout = 7s
General.Cluster.ReplicationBufferSize = 20971520
General.Cluster.ReplicationPullTimeout = 5s
General.Cluster.ReplicationRetryTimeout = 5s
General.Cluster.ReplicationBackgroundRefreshInterval = 5m0s
General.Cluster.ReplicationMaxRetries = 12
General.Cluster.SendBufferSize = 10
General.Cluster.CertExpirationWarningThreshold = 168h0m0s
General.Cluster.TLSHandshakeTimeShift = 0s
General.Keepalive.ServerMinInterval = 1m0s
General.Keepalive.ServerInterval = 2h0m0s
General.Keepalive.ServerTimeout = 20s
General.ConnectionTimeout = 0s
General.GenesisMethod = "file"
General.GenesisFile = "/system-genesis-block/genesis.block"
Kafka.Retry.LongTotal = 12h0m0s
Kafka.Retry.NetworkTimeouts.DialTimeout = 10s
Kafka.Retry.NetworkTimeouts.ReadTimeout = 10s
Kafka.Retry.NetworkTimeouts.WriteTimeout = 10s
Kafka.Retry.Metadata.RetryMax = 3
Kafka.Retry.Metadata.RetryBackoff = 250ms
Kafka.Retry.Producer.RetryMax = 3
Kafka.Retry.Producer.RetryBackoff = 100ms
Kafka.Retry.Consumer.RetryBackoff = 2s
Kafka.Verbose = false
Kafka.Version = 0.10.2.0
Kafka.TLS.Enabled = false
Kafka.TLS.PrivateKey = ""
Kafka.TLS.Certificate = ""
Kafka.TLS.RootCAs = []
Kafka.TLS.ClientAuthRequired = false
Kafka.TLS.ClientRootCAs = []
Kafka.TLS.TLSHandshakeTimeShift = 0s
Kafka.SASLPlain.Enabled = false
Kafka.SASLPlain.User = ""
Kafka.SASLPlain.Password = ""
Kafka.Topic.ReplicationFactor = 3
Debug.BroadcastTraceDir = ""
Debug.DeliverTraceDir = ""
Consensus = map[SnapDir:/var/hyperledger/production/orderer/etcdraft/snapshot WALDir:/var/hyperledger/production/orderer/etcdraft/wal]
Operations.ListenAddress = "127.0.0.1:8443"
Operations.TLS.Enabled = false
Operations.TLS.PrivateKey = ""
Operations.TLS.Certificate = ""
Operations.TLS.RootCAs = []
Operations.TLS.ClientAuthRequired = false
Operations.TLS.ClientRootCAs = []
Operations.TLS.TLSHandshakeTimeShift = 0s
Metrics.Provider = "disabled"
Metrics.Statsd.Network = "udp"
Metrics.Statsd.Address = "127.0.0.1:8125"
Metrics.Statsd.WriteInterval = 30s
Metrics.Statsd.Prefix = ""
ChannelParticipation.Enabled = false
ChannelParticipation.MaxRequestBodySize = 1048576
Admin.ListenAddress = "127.0.0.1:9443"
Admin.TLS.Enabled = false
Admin.TLS.PrivateKey = ""
Admin.TLS.Certificate = ""
Admin.TLS.RootCAs = []
Admin.TLS.ClientAuthRequired = true
Admin.TLS.ClientRootCAs = []
Admin.TLS.TLSHandshakeTimeShift = 0s
2022-02-13 04:45:28.773 UTC 0004 INFO [orderer.common.server] initializeServerConfig -> Starting orderer with TLS enabled
2022-02-13 04:45:28.822 UTC 0005 INFO [blkstorage] NewProvider -> Creating new file ledger directory at /var/hyperledger/production/orderer/chains
2022-02-13 04:45:28.870 UTC 0006 INFO [orderer.common.server] Main -> Bootstrapping the system channel
2022-02-13 04:45:28.880 UTC 0007 INFO [blkstorage] newBlockfileMgr -> Getting block information from block storage
2022-02-13 04:45:28.920 UTC 0008 INFO [orderer.common.server] initializeBootstrapChannel -> Initialized the system channel 'system-channel' from bootstrap block
2022-02-13 04:45:28.923 UTC 0009 INFO [orderer.common.server] extractSystemChannel -> Found system channel config block, number: 0
2022-02-13 04:45:28.923 UTC 000a INFO [orderer.common.server] selectClusterBootBlock -> Cluster boot block is bootstrap (genesis) block; Blocks Header.Number system-channel=0, bootstrap=0
2022-02-13 04:45:28.926 UTC 000b INFO [orderer.common.server] Main -> Starting with system channel: system-channel, consensus type: etcdraft
2022-02-13 04:45:28.926 UTC 000c INFO [orderer.common.server] Main -> Setting up cluster
2022-02-13 04:45:28.926 UTC 000d INFO [orderer.common.server] reuseListener -> Cluster listener is not configured, defaulting to use the general listener on port 7050
2022-02-13 04:45:28.959 UTC 000e INFO [orderer.common.cluster] loadVerifier -> Loaded verifier for channel system-channel from config block at index 0
2022-02-13 04:45:28.960 UTC 000f INFO [certmonitor] trackCertExpiration -> The enrollment certificate will expire on 2023-02-10 09:12:00 +0000 UTC
2022-02-13 04:45:28.960 UTC 0010 INFO [certmonitor] trackCertExpiration -> The server TLS certificate will expire on 2023-02-10 09:13:00 +0000 UTC
2022-02-13 04:45:28.960 UTC 0011 INFO [certmonitor] trackCertExpiration -> The client TLS certificate will expire on 2023-02-10 09:13:00 +0000 UTC
2022-02-13 04:45:28.966 UTC 0012 INFO [orderer.consensus.etcdraft] HandleChain -> EvictionSuspicion not set, defaulting to 10m0s
2022-02-13 04:45:28.966 UTC 0013 INFO [orderer.consensus.etcdraft] HandleChain -> With system channel: after eviction InactiveChainRegistry.TrackChain will be called
2022-02-13 04:45:28.966 UTC 0014 INFO [orderer.consensus.etcdraft] createOrReadWAL -> No WAL data found, creating new WAL at path '/var/hyperledger/production/orderer/etcdraft/wal/system-channel' channel=system-channel node=3
2022-02-13 04:45:28.983 UTC 0015 INFO [orderer.commmon.multichannel] initSystemChannel -> Starting system channel 'system-channel' with genesis block hash ac270210ec2258b99948adc06552a14d49463c3457933b1e24a151502c6487e5 and orderer type etcdraft
2022-02-13 04:45:28.983 UTC 0016 INFO [orderer.consensus.etcdraft] Start -> Starting Raft node channel=system-channel node=3
2022-02-13 04:45:28.983 UTC 0017 INFO [orderer.common.cluster] Configure -> Entering, channel: system-channel, nodes: [ID: 1,
Are there any config/env vars we can set in configtx.yaml
or orderer-deployment.yaml
to avoid this problem?
Are there any timeouts etc. we can increase/set in configtx.yaml
or orderer-deployment.yaml
to avoid this problem?
One more observation I made is, when we delete (manually) the Orderer POD (one or two) and when it spins up new Orderer PODs in place automatically, they will keep restarting with the same error.
Turns out my WAL logs directory was deleted. Anyone landing on this question, please set following (if not already set) ENV variables on your Orderer deployments:
- name: ORDERER_CONSENSUS_SNAPDIR
value: /var/hyperledger/production/orderer/etcdraft/snapshot
- name: ORDERER_CONSENSUS_WALDIR
value: /var/hyperledger/production/orderer/etcdraft/wal
If you already have above ENV variables set on your Orderer deployments then check if the directories are empty for restarting Orderers at above locations. If they are empty, it means your WAL logs were deleted. Orderer can't recover if WAL logs are not available.