EventStore Cluster Fail After Node Reboot

Hi,

I have a 3 node ES cluster in Azure. The Master node was spiking above 80% CPU, with the EventStore process being particularly high. It was hovering at around 20% on the other nodes, so I decided to simply reboot the master node.

After it’s reboot and service restart, the Cluster is not coming online. Looking in the log file on the rebooted server, I’m seeing a lot of these errors…

[PID:06176:005 2017.12.12 10:08:33.970 DEBUG HttpEntityManager ] Error during setting content length on HTTP response: This operation cannot be performed after the response has been submitted…

[PID:06176:005 2017.12.12 10:08:34.970 DEBUG HttpEntityManager ] Error during setting content length on HTTP response: This operation cannot be performed after the response has been submitted…

[PID:06176:013 2017.12.12 10:08:35.126 DEBUG GossipController ] Error while reading request (gossip): The I/O operation has been aborted because of either a thread exit or an application request

[PID:06176:013 2017.12.12 10:08:36.345 DEBUG HttpEntityManager ] Close connection error (after crash in read request): An operation was attempted on a nonexistent network connection

[PID:06176:013 2017.12.12 10:08:36.345 DEBUG GossipController ] Error while reading request (gossip): The I/O operation has been aborted because of either a thread exit or an application request

.

.

.

[PID:06176:004 2017.12.12 10:09:12.143 DEBUG HttpEntityManager ] Close connection error (after crash in read request): The parameter is incorrect

[PID:06176:004 2017.12.12 10:09:12.143 DEBUG GossipController ] Error while reading request (gossip): The I/O operation has been aborted because of either a thread exit or an application request

[PID:06176:012 2017.12.12 10:09:12.533 DEBUG IndexCommitter ] ReadIndex Rebuilding: processed 330250 records (41.1%).

[PID:06176:006 2017.12.12 10:09:12.611 DEBUG HttpEntityManager ] Error during setting content length on HTTP response: This operation cannot be performed after the response has been submitted…

[PID:06176:004 2017.12.12 10:09:13.361 DEBUG HttpEntityManager ] Close connection error (after crash in read request): The parameter is incorrect

[PID:06176:004 2017.12.12 10:09:13.361 DEBUG GossipController ] Error while reading request (gossip): The I/O operation has been aborted because of either a thread exit or an application request

[PID:06176:011 2017.12.12 10:09:13.611 TRACE GossipServiceBase ] CLUSTER HAS CHANGED (gossip received from [10.23.64.18:2113])

[PID:06176:011 2017.12.12 10:09:13.611 TRACE GossipServiceBase ] Old:

[PID:06176:011 2017.12.12 10:09:13.611 TRACE GossipServiceBase ] VND {6f92cfca-9310-43c9-9928-dd5401f8371b} [Slave, 10.23.64.18:1113, n/a, 10.23.64.18:1112, n/a, 10.23.64.18:2113, 10.23.64.18:2114] 9979866069/9979883854/9979883854/E10848@9979865831:{08f73fb1-d4aa-4294-bac5-dc12a0771dc5} | 2017-12-12 10:09:10.349

[PID:06176:011 2017.12.12 10:09:13.611 TRACE GossipServiceBase ] VND {2ad01055-f180-46dd-9ac3-2db34524ba62} [Master, 10.23.64.17:1113, n/a, 10.23.64.17:1112, n/a, 10.23.64.17:2113, 10.23.64.17:2114] 9979866069/9979883854/9979883854/E10848@9979865831:{08f73fb1-d4aa-4294-bac5-dc12a0771dc5} | 2017-12-12 10:09:12.127

[PID:06176:011 2017.12.12 10:09:13.611 TRACE GossipServiceBase ] VND {9ba44d5b-f21f-4d42-87d7-87e994e5d689} [Initializing, 10.23.64.15:1113, 10.23.64.15:0, 10.23.64.15:1112, 10.23.64.15:0, 10.23.64.15:2113, 10.23.64.15:2114] 4103601075/9975646901/9975646901/E10839@9975596515:{1094ed09-6399-40e9-a92c-017322f6d1d8} | 2017-12-12 10:09:13.611

[PID:06176:011 2017.12.12 10:09:13.611 TRACE GossipServiceBase ] New:

[PID:06176:011 2017.12.12 10:09:13.611 TRACE GossipServiceBase ] VND {6f92cfca-9310-43c9-9928-dd5401f8371b} [Slave, 10.23.64.18:1113, n/a, 10.23.64.18:1112, n/a, 10.23.64.18:2113, 10.23.64.18:2114] 9979866069/9979883854/9979883854/E10848@9979865831:{08f73fb1-d4aa-4294-bac5-dc12a0771dc5} | 2017-12-12 10:09:13.589

[PID:06176:011 2017.12.12 10:09:13.611 TRACE GossipServiceBase ] VND {2ad01055-f180-46dd-9ac3-2db34524ba62} [Master, 10.23.64.17:1113, n/a, 10.23.64.17:1112, n/a, 10.23.64.17:2113, 10.23.64.17:2114] 9979866069/9979883854/9979883854/E10848@9979865831:{08f73fb1-d4aa-4294-bac5-dc12a0771dc5} | 2017-12-12 10:09:13.589

[PID:06176:011 2017.12.12 10:09:13.611 TRACE GossipServiceBase ] VND {9ba44d5b-f21f-4d42-87d7-87e994e5d689} [Initializing, 10.23.64.15:1113, 10.23.64.15:0, 10.23.64.15:1112, 10.23.64.15:0, 10.23.64.15:2113, 10.23.64.15:2114] 4103601075/9975646901/9975646901/E10839@9975596515:{1094ed09-6399-40e9-a92c-017322f6d1d8} | 2017-12-12 10:09:13.611

[PID:06176:011 2017.12.12 10:09:13.611 TRACE GossipServiceBase ] --------------------------------------------------------------------------------

[PID:06176:011 2017.12.12 10:09:15.136 TRACE GossipServiceBase ] Looks like node [10.23.64.17:2113] is DEAD (Gossip send failed).

[PID:06176:011 2017.12.12 10:09:15.136 TRACE GossipServiceBase ] CLUSTER HAS CHANGED (gossip send failed to [10.23.64.17:2113])

[PID:06176:011 2017.12.12 10:09:15.136 TRACE GossipServiceBase ] Old:

[PID:06176:011 2017.12.12 10:09:15.136 TRACE GossipServiceBase ] VND {6f92cfca-9310-43c9-9928-dd5401f8371b} <LIVE> [Slave, 10.23.64.18:1113, n/a, 10.23.64.18:1112, n/a, 10.23.64.18:2113, 10.23.64.18:2114] 9979866069/9979883854/9979883854/E10848@9979865831:{08f73fb1-d4aa-4294-bac5-dc12a0771dc5} | 2017-12-12 10:09:14.381

[PID:06176:011 2017.12.12 10:09:15.136 TRACE GossipServiceBase ] VND {2ad01055-f180-46dd-9ac3-2db34524ba62} <LIVE> [Master, 10.23.64.17:1113, n/a, 10.23.64.17:1112, n/a, 10.23.64.17:2113, 10.23.64.17:2114] 9979866069/9979883854/9979883854/E10848@9979865831:{08f73fb1-d4aa-4294-bac5-dc12a0771dc5} | 2017-12-12 10:09:13.589

[PID:06176:011 2017.12.12 10:09:15.136 TRACE GossipServiceBase ] VND {9ba44d5b-f21f-4d42-87d7-87e994e5d689} <LIVE> [Initializing, 10.23.64.15:1113, 10.23.64.15:0, 10.23.64.15:1112, 10.23.64.15:0, 10.23.64.15:2113, 10.23.64.15:2114] 4107207404/9975646901/9975646901/E10839@9975596515:{1094ed09-6399-40e9-a92c-017322f6d1d8} | 2017-12-12 10:09:14.621

[PID:06176:011 2017.12.12 10:09:15.136 TRACE GossipServiceBase ] New:

[PID:06176:011 2017.12.12 10:09:15.136 TRACE GossipServiceBase ] VND {6f92cfca-9310-43c9-9928-dd5401f8371b} <LIVE> [Slave, 10.23.64.18:1113, n/a, 10.23.64.18:1112, n/a, 10.23.64.18:2113, 10.23.64.18:2114] 9979866069/9979883854/9979883854/E10848@9979865831:{08f73fb1-d4aa-4294-bac5-dc12a0771dc5} | 2017-12-12 10:09:14.381

[PID:06176:011 2017.12.12 10:09:15.136 TRACE GossipServiceBase ] VND {2ad01055-f180-46dd-9ac3-2db34524ba62} <DEAD> [Master, 10.23.64.17:1113, n/a, 10.23.64.17:1112, n/a, 10.23.64.17:2113, 10.23.64.17:2114] 9979866069/9979883854/9979883854/E10848@9979865831:{08f73fb1-d4aa-4294-bac5-dc12a0771dc5} | 2017-12-12 10:09:15.136

[PID:06176:011 2017.12.12 10:09:15.136 TRACE GossipServiceBase ] VND {9ba44d5b-f21f-4d42-87d7-87e994e5d689} <LIVE> [Initializing, 10.23.64.15:1113, 10.23.64.15:0, 10.23.64.15:1112, 10.23.64.15:0, 10.23.64.15:2113, 10.23.64.15:2114] 4107207404/9975646901/9975646901/E10839@9975596515:{1094ed09-6399-40e9-a92c-017322f6d1d8} | 2017-12-12 10:09:14.621

[PID:06176:011 2017.12.12 10:09:15.136 TRACE GossipServiceBase ] --------------------------------------------------------------------------------

[PID:06176:011 2017.12.12 10:09:15.621 TRACE GossipServiceBase ] CLUSTER HAS CHANGED (gossip received from [10.23.64.18:2113])

[PID:06176:011 2017.12.12 10:09:15.621 TRACE GossipServiceBase ] Old:

[PID:06176:011 2017.12.12 10:09:15.621 TRACE GossipServiceBase ] VND {6f92cfca-9310-43c9-9928-dd5401f8371b} <LIVE> [Slave, 10.23.64.18:1113, n/a, 10.23.64.18:1112, n/a, 10.23.64.18:2113, 10.23.64.18:2114] 9979866069/9979883854/9979883854/E10848@9979865831:{08f73fb1-d4aa-4294-bac5-dc12a0771dc5} | 2017-12-12 10:09:15.393

[PID:06176:011 2017.12.12 10:09:15.621 TRACE GossipServiceBase ] VND {2ad01055-f180-46dd-9ac3-2db34524ba62} <DEAD> [Master, 10.23.64.17:1113, n/a, 10.23.64.17:1112, n/a, 10.23.64.17:2113, 10.23.64.17:2114] 9979866069/9979883854/9979883854/E10848@9979865831:{08f73fb1-d4aa-4294-bac5-dc12a0771dc5} | 2017-12-12 10:09:15.136

[PID:06176:011 2017.12.12 10:09:15.621 TRACE GossipServiceBase ] VND {9ba44d5b-f21f-4d42-87d7-87e994e5d689} <LIVE> [Initializing, 10.23.64.15:1113, 10.23.64.15:0, 10.23.64.15:1112, 10.23.64.15:0, 10.23.64.15:2113, 10.23.64.15:2114] 4110597233/9975646901/9975646901/E10839@9975596515:{1094ed09-6399-40e9-a92c-017322f6d1d8} | 2017-12-12 10:09:15.621

[PID:06176:011 2017.12.12 10:09:15.621 TRACE GossipServiceBase ] New:

[PID:06176:011 2017.12.12 10:09:15.621 TRACE GossipServiceBase ] VND {6f92cfca-9310-43c9-9928-dd5401f8371b} <LIVE> [Slave, 10.23.64.18:1113, n/a, 10.23.64.18:1112, n/a, 10.23.64.18:2113, 10.23.64.18:2114] 9979866069/9979883854/9979883854/E10848@9979865831:{08f73fb1-d4aa-4294-bac5-dc12a0771dc5} | 2017-12-12 10:09:15.612

[PID:06176:011 2017.12.12 10:09:15.621 TRACE GossipServiceBase ] VND {2ad01055-f180-46dd-9ac3-2db34524ba62} <LIVE> [Master, 10.23.64.17:1113, n/a, 10.23.64.17:1112, n/a, 10.23.64.17:2113, 10.23.64.17:2114] 9979866069/9979883854/9979883854/E10848@9979865831:{08f73fb1-d4aa-4294-bac5-dc12a0771dc5} | 2017-12-12 10:09:15.393

[PID:06176:011 2017.12.12 10:09:15.621 TRACE GossipServiceBase ] VND {9ba44d5b-f21f-4d42-87d7-87e994e5d689} <LIVE> [Initializing, 10.23.64.15:1113, 10.23.64.15:0, 10.23.64.15:1112, 10.23.64.15:0, 10.23.64.15:2113, 10.23.64.15:2114] 4110597233/9975646901/9975646901/E10839@9975596515:{1094ed09-6399-40e9-a92c-017322f6d1d8} | 2017-12-12 10:09:15.621

Any idea how to bring this node back online? What’s causing the underlying issue? What is causing the status to cycle between LIVE and DEAD?

Thanks

Hi,

It’s back online. It took almost an hour for it to settle down again. I’m not sure where the problem lay, whether in gES, or the underlying Azure infrastructure supporting the IAAS VM’s.

Is there a recommended way to reboot nodes in a cluster? How can I select MASTER so that I can ensure I’m not rebooting the currently delegated one?

Thanks