Hi,
I have a 3 node ES cluster in Azure. The Master node was spiking above 80% CPU, with the EventStore process being particularly high. It was hovering at around 20% on the other nodes, so I decided to simply reboot the master node.
After it’s reboot and service restart, the Cluster is not coming online. Looking in the log file on the rebooted server, I’m seeing a lot of these errors…
[PID:06176:005 2017.12.12 10:08:33.970 DEBUG HttpEntityManager ] Error during setting content length on HTTP response: This operation cannot be performed after the response has been submitted…
[PID:06176:005 2017.12.12 10:08:34.970 DEBUG HttpEntityManager ] Error during setting content length on HTTP response: This operation cannot be performed after the response has been submitted…
[PID:06176:013 2017.12.12 10:08:35.126 DEBUG GossipController ] Error while reading request (gossip): The I/O operation has been aborted because of either a thread exit or an application request
[PID:06176:013 2017.12.12 10:08:36.345 DEBUG HttpEntityManager ] Close connection error (after crash in read request): An operation was attempted on a nonexistent network connection
[PID:06176:013 2017.12.12 10:08:36.345 DEBUG GossipController ] Error while reading request (gossip): The I/O operation has been aborted because of either a thread exit or an application request
.
.
.
[PID:06176:004 2017.12.12 10:09:12.143 DEBUG HttpEntityManager ] Close connection error (after crash in read request): The parameter is incorrect
[PID:06176:004 2017.12.12 10:09:12.143 DEBUG GossipController ] Error while reading request (gossip): The I/O operation has been aborted because of either a thread exit or an application request
[PID:06176:012 2017.12.12 10:09:12.533 DEBUG IndexCommitter ] ReadIndex Rebuilding: processed 330250 records (41.1%).
[PID:06176:006 2017.12.12 10:09:12.611 DEBUG HttpEntityManager ] Error during setting content length on HTTP response: This operation cannot be performed after the response has been submitted…
[PID:06176:004 2017.12.12 10:09:13.361 DEBUG HttpEntityManager ] Close connection error (after crash in read request): The parameter is incorrect
[PID:06176:004 2017.12.12 10:09:13.361 DEBUG GossipController ] Error while reading request (gossip): The I/O operation has been aborted because of either a thread exit or an application request
[PID:06176:011 2017.12.12 10:09:13.611 TRACE GossipServiceBase ] CLUSTER HAS CHANGED (gossip received from [10.23.64.18:2113])
[PID:06176:011 2017.12.12 10:09:13.611 TRACE GossipServiceBase ] Old:
[PID:06176:011 2017.12.12 10:09:13.611 TRACE GossipServiceBase ] VND {6f92cfca-9310-43c9-9928-dd5401f8371b} [Slave, 10.23.64.18:1113, n/a, 10.23.64.18:1112, n/a, 10.23.64.18:2113, 10.23.64.18:2114] 9979866069/9979883854/9979883854/E10848@9979865831:{08f73fb1-d4aa-4294-bac5-dc12a0771dc5} | 2017-12-12 10:09:10.349
[PID:06176:011 2017.12.12 10:09:13.611 TRACE GossipServiceBase ] VND {2ad01055-f180-46dd-9ac3-2db34524ba62} [Master, 10.23.64.17:1113, n/a, 10.23.64.17:1112, n/a, 10.23.64.17:2113, 10.23.64.17:2114] 9979866069/9979883854/9979883854/E10848@9979865831:{08f73fb1-d4aa-4294-bac5-dc12a0771dc5} | 2017-12-12 10:09:12.127
[PID:06176:011 2017.12.12 10:09:13.611 TRACE GossipServiceBase ] VND {9ba44d5b-f21f-4d42-87d7-87e994e5d689} [Initializing, 10.23.64.15:1113, 10.23.64.15:0, 10.23.64.15:1112, 10.23.64.15:0, 10.23.64.15:2113, 10.23.64.15:2114] 4103601075/9975646901/9975646901/E10839@9975596515:{1094ed09-6399-40e9-a92c-017322f6d1d8} | 2017-12-12 10:09:13.611
[PID:06176:011 2017.12.12 10:09:13.611 TRACE GossipServiceBase ] New:
[PID:06176:011 2017.12.12 10:09:13.611 TRACE GossipServiceBase ] VND {6f92cfca-9310-43c9-9928-dd5401f8371b} [Slave, 10.23.64.18:1113, n/a, 10.23.64.18:1112, n/a, 10.23.64.18:2113, 10.23.64.18:2114] 9979866069/9979883854/9979883854/E10848@9979865831:{08f73fb1-d4aa-4294-bac5-dc12a0771dc5} | 2017-12-12 10:09:13.589
[PID:06176:011 2017.12.12 10:09:13.611 TRACE GossipServiceBase ] VND {2ad01055-f180-46dd-9ac3-2db34524ba62} [Master, 10.23.64.17:1113, n/a, 10.23.64.17:1112, n/a, 10.23.64.17:2113, 10.23.64.17:2114] 9979866069/9979883854/9979883854/E10848@9979865831:{08f73fb1-d4aa-4294-bac5-dc12a0771dc5} | 2017-12-12 10:09:13.589
[PID:06176:011 2017.12.12 10:09:13.611 TRACE GossipServiceBase ] VND {9ba44d5b-f21f-4d42-87d7-87e994e5d689} [Initializing, 10.23.64.15:1113, 10.23.64.15:0, 10.23.64.15:1112, 10.23.64.15:0, 10.23.64.15:2113, 10.23.64.15:2114] 4103601075/9975646901/9975646901/E10839@9975596515:{1094ed09-6399-40e9-a92c-017322f6d1d8} | 2017-12-12 10:09:13.611
[PID:06176:011 2017.12.12 10:09:13.611 TRACE GossipServiceBase ] --------------------------------------------------------------------------------
[PID:06176:011 2017.12.12 10:09:15.136 TRACE GossipServiceBase ] Looks like node [10.23.64.17:2113] is DEAD (Gossip send failed).
[PID:06176:011 2017.12.12 10:09:15.136 TRACE GossipServiceBase ] CLUSTER HAS CHANGED (gossip send failed to [10.23.64.17:2113])
[PID:06176:011 2017.12.12 10:09:15.136 TRACE GossipServiceBase ] Old:
[PID:06176:011 2017.12.12 10:09:15.136 TRACE GossipServiceBase ] VND {6f92cfca-9310-43c9-9928-dd5401f8371b} <LIVE> [Slave, 10.23.64.18:1113, n/a, 10.23.64.18:1112, n/a, 10.23.64.18:2113, 10.23.64.18:2114] 9979866069/9979883854/9979883854/E10848@9979865831:{08f73fb1-d4aa-4294-bac5-dc12a0771dc5} | 2017-12-12 10:09:14.381
[PID:06176:011 2017.12.12 10:09:15.136 TRACE GossipServiceBase ] VND {2ad01055-f180-46dd-9ac3-2db34524ba62} <LIVE> [Master, 10.23.64.17:1113, n/a, 10.23.64.17:1112, n/a, 10.23.64.17:2113, 10.23.64.17:2114] 9979866069/9979883854/9979883854/E10848@9979865831:{08f73fb1-d4aa-4294-bac5-dc12a0771dc5} | 2017-12-12 10:09:13.589
[PID:06176:011 2017.12.12 10:09:15.136 TRACE GossipServiceBase ] VND {9ba44d5b-f21f-4d42-87d7-87e994e5d689} <LIVE> [Initializing, 10.23.64.15:1113, 10.23.64.15:0, 10.23.64.15:1112, 10.23.64.15:0, 10.23.64.15:2113, 10.23.64.15:2114] 4107207404/9975646901/9975646901/E10839@9975596515:{1094ed09-6399-40e9-a92c-017322f6d1d8} | 2017-12-12 10:09:14.621
[PID:06176:011 2017.12.12 10:09:15.136 TRACE GossipServiceBase ] New:
[PID:06176:011 2017.12.12 10:09:15.136 TRACE GossipServiceBase ] VND {6f92cfca-9310-43c9-9928-dd5401f8371b} <LIVE> [Slave, 10.23.64.18:1113, n/a, 10.23.64.18:1112, n/a, 10.23.64.18:2113, 10.23.64.18:2114] 9979866069/9979883854/9979883854/E10848@9979865831:{08f73fb1-d4aa-4294-bac5-dc12a0771dc5} | 2017-12-12 10:09:14.381
[PID:06176:011 2017.12.12 10:09:15.136 TRACE GossipServiceBase ] VND {2ad01055-f180-46dd-9ac3-2db34524ba62} <DEAD> [Master, 10.23.64.17:1113, n/a, 10.23.64.17:1112, n/a, 10.23.64.17:2113, 10.23.64.17:2114] 9979866069/9979883854/9979883854/E10848@9979865831:{08f73fb1-d4aa-4294-bac5-dc12a0771dc5} | 2017-12-12 10:09:15.136
[PID:06176:011 2017.12.12 10:09:15.136 TRACE GossipServiceBase ] VND {9ba44d5b-f21f-4d42-87d7-87e994e5d689} <LIVE> [Initializing, 10.23.64.15:1113, 10.23.64.15:0, 10.23.64.15:1112, 10.23.64.15:0, 10.23.64.15:2113, 10.23.64.15:2114] 4107207404/9975646901/9975646901/E10839@9975596515:{1094ed09-6399-40e9-a92c-017322f6d1d8} | 2017-12-12 10:09:14.621
[PID:06176:011 2017.12.12 10:09:15.136 TRACE GossipServiceBase ] --------------------------------------------------------------------------------
[PID:06176:011 2017.12.12 10:09:15.621 TRACE GossipServiceBase ] CLUSTER HAS CHANGED (gossip received from [10.23.64.18:2113])
[PID:06176:011 2017.12.12 10:09:15.621 TRACE GossipServiceBase ] Old:
[PID:06176:011 2017.12.12 10:09:15.621 TRACE GossipServiceBase ] VND {6f92cfca-9310-43c9-9928-dd5401f8371b} <LIVE> [Slave, 10.23.64.18:1113, n/a, 10.23.64.18:1112, n/a, 10.23.64.18:2113, 10.23.64.18:2114] 9979866069/9979883854/9979883854/E10848@9979865831:{08f73fb1-d4aa-4294-bac5-dc12a0771dc5} | 2017-12-12 10:09:15.393
[PID:06176:011 2017.12.12 10:09:15.621 TRACE GossipServiceBase ] VND {2ad01055-f180-46dd-9ac3-2db34524ba62} <DEAD> [Master, 10.23.64.17:1113, n/a, 10.23.64.17:1112, n/a, 10.23.64.17:2113, 10.23.64.17:2114] 9979866069/9979883854/9979883854/E10848@9979865831:{08f73fb1-d4aa-4294-bac5-dc12a0771dc5} | 2017-12-12 10:09:15.136
[PID:06176:011 2017.12.12 10:09:15.621 TRACE GossipServiceBase ] VND {9ba44d5b-f21f-4d42-87d7-87e994e5d689} <LIVE> [Initializing, 10.23.64.15:1113, 10.23.64.15:0, 10.23.64.15:1112, 10.23.64.15:0, 10.23.64.15:2113, 10.23.64.15:2114] 4110597233/9975646901/9975646901/E10839@9975596515:{1094ed09-6399-40e9-a92c-017322f6d1d8} | 2017-12-12 10:09:15.621
[PID:06176:011 2017.12.12 10:09:15.621 TRACE GossipServiceBase ] New:
[PID:06176:011 2017.12.12 10:09:15.621 TRACE GossipServiceBase ] VND {6f92cfca-9310-43c9-9928-dd5401f8371b} <LIVE> [Slave, 10.23.64.18:1113, n/a, 10.23.64.18:1112, n/a, 10.23.64.18:2113, 10.23.64.18:2114] 9979866069/9979883854/9979883854/E10848@9979865831:{08f73fb1-d4aa-4294-bac5-dc12a0771dc5} | 2017-12-12 10:09:15.612
[PID:06176:011 2017.12.12 10:09:15.621 TRACE GossipServiceBase ] VND {2ad01055-f180-46dd-9ac3-2db34524ba62} <LIVE> [Master, 10.23.64.17:1113, n/a, 10.23.64.17:1112, n/a, 10.23.64.17:2113, 10.23.64.17:2114] 9979866069/9979883854/9979883854/E10848@9979865831:{08f73fb1-d4aa-4294-bac5-dc12a0771dc5} | 2017-12-12 10:09:15.393
[PID:06176:011 2017.12.12 10:09:15.621 TRACE GossipServiceBase ] VND {9ba44d5b-f21f-4d42-87d7-87e994e5d689} <LIVE> [Initializing, 10.23.64.15:1113, 10.23.64.15:0, 10.23.64.15:1112, 10.23.64.15:0, 10.23.64.15:2113, 10.23.64.15:2114] 4110597233/9975646901/9975646901/E10839@9975596515:{1094ed09-6399-40e9-a92c-017322f6d1d8} | 2017-12-12 10:09:15.621
Any idea how to bring this node back online? What’s causing the underlying issue? What is causing the status to cycle between LIVE and DEAD?
Thanks