cluster unstable (VERY SLOW QUEUE MSG [StorageWriterQueue])

Chaohui_Zhang · May 5, 2017, 6:36am

Hello,

In April and May 2017, our cluster throw exception like “VERY SLOW QUEUE MSG” and “CommitTimeout”, it was happened usually, but frequent since April.

Our cluster environment have 3 nodes, all used Rax-Space virtual machine, each machine environment is

Windows version: Windows Server 2008 R2
CLR Version: 4
Framework: 4.6.1
EventStore version: 3.5.0 OSS
below are our part of error log:

Time
Event
5/4/17 8:15:47.498 AM
[PID:03076:006 2017.05.04 08:15:47.498 DEBUG HttpSendService ] Error occurred while replying to HTTP with message EventStore.Projections.Core.Messages.ProjectionManagementMessage+Statistics: The specified network name is no longer available.
5/4/17 8:15:39.599 AM
[PID:03076:011 2017.05.04 08:13:39.599 ERROR QueuedHandlerMRES ] —!!! VERY SLOW QUEUE MSG [StorageWriterQueue]: WritePrepares - 50094ms. Q: 0/22.
5/4/17 8:09:39.599 AM
[PID:03076:004 2017.05.04 08:09:12.871 INFO CoreProjectionCheckp] Failed to write events to stream $ce-xxx. Error: CommitTimeout

We used https on each server, and each server have more than 800 trunk files, usage almost 210GB disk space.

From the resource monitor, the disk and network have low usage, disk keep 1 or 2 MB/s write rate, and network just less than 10% usage.

So could you please help me to check this error?

In other side, to avoid time difference between synchronized data from master to slave, we setting our application write and read both from master.

and in master, EventStore.ClusterNode.exe loaded lots of trunk file at same time, just one trunk for write, other are read, it matched with our application rule, but I don’t know whether this trigger above “Very Slow Queue Msg” error.

here is the master server disk monitor screenshot:

Thanks for you help

Chaohui Zhang

Joao_Braganca · May 5, 2017, 6:42am

In other side, to avoid time difference between synchronized data from master to slave, we setting our application write and read both from master.
and in master, EventStore.ClusterNode.exe loaded lots of trunk file at same time, just one trunk for write, other are read, it matched with our application rule, but I don’t know whether this trigger above “Very Slow Queue Msg” error.

Can you clarify what you mean here?

Chaohui_Zhang · May 5, 2017, 6:56am

it means, after one application raise one event, other consumer want use this event as soon as possible, on before, we used load balance redirect to event store, for keep event order, we just write event to master, because event sync from node A to node B need some times, if we created-event sent to nodeA, but updated-event sent to nodeB, at this moment, nodeB don’t have created this stream, will throw exception.
and for read event also from master, created-event read form master, and consumer received updated-event link, but if slave node don’t finish sync the updated-event and consumer visit this updated-event from slave, will throw 404 error. so we read also from master. two salve nodes just do a backup.

Is clarify for you?

Chaohui_Zhang · May 5, 2017, 7:00am

it means, after one application raise one event, other consumer want use this event as soon as possible, on before, we used load balance redirect to event store, for keep event order, we just write event to master, because event sync from node A to node B need some times, if we created-event sent to nodeA, but updated-event sent to nodeB, at this moment, nodeB don’t have created this stream, will throw exception.

and for read event also from master, created-event read form master, and consumer received updated-event link, but if slave node don’t finish sync the updated-event and consumer visit this updated-event from slave, will throw 404 error. so we read also from master. two salve nodes just do a backup.

Is clarify for you?

Thank you for help us.

Greg_Young1 · May 5, 2017, 8:14am

What is your configuration? Also there was a an upgrade dealing with indexes and what caching depth should be used though I forget exactly what version it came out in. With such a large db this may be involved

Chaohui_Zhang · May 5, 2017, 9:28am

We use windows service do a wrapper to start to stop EventStoreClusterNode.exe, and send the parameter to EventStoreClusterNode application though wrapper, the configuration below, and other used default setting.

<instance name="eventstore"

      add-interface-prefixes="false"

        int-http-prefixes="http://172.30.1.193:2113/"

        ext-http-prefixes="http://172.30.1.193:2114/,http://172.30.1.193:2113/"

        ext-ip="127.0.0.1"

        int-ip="127.0.0.1"

        int-tcp-port="1112"

        ext-tcp-port="1113"

        int-http-port="2113"

        ext-http-port="2114"

        db="D:\eventstore-db-2"

        log=".\eslogs"

        use-internal-ssl="false"

        ssl-target-host="xxx.com"

        ssl-validate-server="false"

        run-projections="all"

         certificate-store-location="LocalMachine"

         certificate-store-name="My"

         certificate-thumbprint="253tgrasd....."

         ext-secure-tcp-port="2115"

         int-secure-tcp-port="2116"

         cluster-size="3"

        discover-via-dns="false"

        gossip-seed="172.30.1.194:2113,172.30.1.195:2113"
 />

Thank you very much.

Greg_Young1 · May 5, 2017, 9:42am

Can you try updating a node to 3.9.4 and let us know if that node still sees the issue?

Chaohui_Zhang · May 5, 2017, 9:47am

So you mean just upgrade one nodes to 3.9.4? other two nodes keep 3.5.0? they can working in one cluster fine?

Thank you very much!

Greg_Young1 · May 5, 2017, 9:54am

yes nodes can work side by side. my guess is there is a fix somewhere around 3.7 that could be the cause of your issue dealing with a large db (index cache sizing)

Greg_Young1 · May 5, 2017, 9:54am

you can also do this without downtime, just look at which node is master and update one of the other two.

Chaohui_Zhang · May 5, 2017, 10:02am

I know in cluster we have a rule to switch master and slave, so if we upgrade on node to 3.9.4, and this node become master, it’s also working fine?

Thank you very much.

Greg_Young1 · May 5, 2017, 10:23am

yes we use semantic versioning thus any versions 3.x.x can run next to each other.

Chaohui_Zhang · May 10, 2017, 6:46am

we upgrade one node to 3.9.4, this node is not master, but we found below logs in this node:

[PID:01484:009 2017.05.10 06:05:16.444 TRACE ClusterStorageWriter] Completing data chunk 76-76…

[PID:01484:009 2017.05.10 06:05:16.476 TRACE TFChunk ] CACHED TFChunk #77-77 (chunk-000077.000000) in 00:00:00.0047554.

[PID:01484:027 2017.05.10 06:05:16.538 TRACE TFChunk ] UNCACHED TFChunk #75-75 (chunk-000075.000000).

[PID:01484:013 2017.05.10 06:05:36.765 TRACE TableIndex ] Switching MemTable, currently: 6 awaiting tables.

[PID:01484:009 2017.05.10 06:07:24.725 TRACE ClusterStorageWriter] Completing data chunk 77-77…

[PID:01484:009 2017.05.10 06:07:24.756 TRACE TFChunk ] CACHED TFChunk #78-78 (chunk-000078.000000) in 00:00:00.0031866.

[PID:01484:021 2017.05.10 06:07:24.803 TRACE TFChunk ] UNCACHED TFChunk #76-76 (chunk-000076.000000).

[PID:01484:012 2017.05.10 06:08:19.518 TRACE InMemoryBus ] SLOW BUS MSG [MainBus]: DataChunkBulk - 51ms. Handler: WideningHandler`2.

[PID:01484:012 2017.05.10 06:08:19.518 TRACE QueuedHandlerMRES ] SLOW QUEUE MSG [MainQueue]: DataChunkBulk - 53ms. Q: 5/37.

[PID:01484:009 2017.05.10 06:09:25.074 TRACE ClusterStorageWriter] Completing data chunk 78-78…

[PID:01484:009 2017.05.10 06:09:25.090 TRACE TFChunk ] CACHED TFChunk #79-79 (chunk-000079.000000) in 00:00:00.0025078.

[PID:01484:021 2017.05.10 06:09:25.168 TRACE TFChunk ] UNCACHED TFChunk #77-77 (chunk-000077.000000).

[PID:01484:012 2017.05.10 06:10:37.836 TRACE GossipServiceBase ] Looks like master [172.30.1.194:2113, {4f535c89-00fb-4f65-b9d7-dc4459d68548}] is DEAD (Gossip send failed), though we wait for TCP to decide.

see the last line, but this master machine was working fine.
So what’s this log mean?

Thank you very much!

pieter.germishuys · May 10, 2017, 6:54am

The log is saying that a gossip failed. When this happens we then say “let’s wait for TCP to be the authority of if the master is still alive as we are connected to it” the reason for this is that gossip can fail transiently for a number of reasons whereby TCP tends to be more reliable, forcing an election on gossip fails causes shifting of the cluster.

Chaohui_Zhang · May 10, 2017, 7:41am

but the master and cluster working fine from the web UI.

see this images.

Thank you very much!

Greg_Young1 · May 10, 2017, 7:44am

A gossip failure is usually a transient failure its more than once a second to check if the other nodes are alive by default

Chaohui_Zhang · May 15, 2017, 3:48am

Hello,

We also found below error message in cluster:

Time
Event Error
Machine
5/12/17
6:10:47.210 AM
[PID:07020:019 2017.05.12 06:10:47.210 DEBUG HttpEntityManager ] Error copying forwarded response stream for ‘https://172.20.14.100:2114/projections/all-non-transient’: The specified network name is no longer available.
172.20.14.100
5/12/17
6:10:47.229 AM
[PID:139260:021 2017.05.12 06:10:47.229 DEBUG HttpEntityManager ] Error copying forwarded response stream for ‘https://172.20.14.101:2114/projections/all-non-transient’: The specified network name is no longer available.
172.20.14.101
5/12/17
6:30:47.847 AM
[PID:139260:020 2017.05.12 06:30:47.847 DEBUG HttpEntityManager ] Error copying forwarded response stream for ‘https://172.20.14.101:2114/projections/all-non-transient’: The specified network name is no longer available.
172.20.14.101
5/12/17
6:50:47.692 AM
[PID:139260:021 2017.05.12 06:50:47.692 DEBUG HttpEntityManager ] Error copying forwarded response stream for ‘https://172.20.14.101:2114/projections/all-non-transient’: The specified network name is no longer available.
172.20.14.101
5/12/17
7:00:47.656 AM
[PID:03076:005 2017.05.12 07:00:47.656 DEBUG HttpSendService ] Error occurred while replying to HTTP with message EventStore.Projections.Core.Messages.ProjectionManagementMessage+Statistics: The specified network name is no longer available.
172.20.14.102
5/12/17
7:25:47.697 AM
[PID:139260:038 2017.05.12 07:25:47.697 DEBUG HttpEntityManager ] Error copying forwarded response stream for ‘https://172.20.14.101:2114/projections/all-non-transient’: The specified network name is no longer available.
172.20.14.101
5/12/17
7:30:48.241 AM
[PID:139260:038 2017.05.12 07:30:48.241 DEBUG HttpEntityManager ] Error copying forwarded response stream for ‘https://172.20.14.101:2114/projections/all-non-transient’: The specified network name is no longer available.
172.20.14.101
5/12/17
7:35:47.539 AM
[PID:03076:009 2017.05.12 07:35:47.539 DEBUG HttpSendService ] Error occurred while replying to HTTP with message EventStore.Projections.Core.Messages.ProjectionManagementMessage+Statistics: The specified network name is no longer available.
172.20.14.102
5/12/17
7:40:47.924 AM
[PID:139260:034 2017.05.12 07:40:47.924 DEBUG HttpEntityManager ] Error copying forwarded response stream for ‘https://172.20.14.101:2114/projections/all-non-transient’: The specified network name is no longer available.
172.20.14.101
5/12/17
8:05:47.959 AM
[PID:03076:025 2017.05.12 08:05:47.959 DEBUG HttpSendService ] Error occurred while replying to HTTP with message EventStore.Projections.Core.Messages.ProjectionManagementMessage+Statistics: The specified network name is no longer available.
172.20.14.102
5/12/17
8:25:47.983 AM
[PID:139260:004 2017.05.12 08:25:47.983 DEBUG HttpEntityManager ] Error copying forwarded response stream for ‘https://172.20.14.101:2114/projections/all-non-transient’: The specified network name is no longer available.
172.20.14.101
5/12/17
8:55:13.216 AM
[PID:03076:004 2017.05.12 08:55:13.216 INFO CoreProjectionCheckp] Failed to write events to stream $ce-xxx. Error: CommitTimeout
172.20.14.102
5/12/17
8:55:13.216 AM
[PID:03076:004 2017.05.12 08:55:13.216 INFO CoreProjectionCheckp] Failed to write events to stream $projections-$ce-xxx-order. Error: CommitTimeout
172.20.14.102
5/12/17
8:56:25.729 AM
[PID:03076:011 2017.05.12 08:56:25.729 ERROR QueuedHandlerMRES ] —!!! VERY SLOW QUEUE MSG [StorageWriterQueue]: WritePrepares - 75092ms. Q: 1/159.
172.20.14.102

what’s this error mean? after “VERY SLOW” happened, cluster can’t connection.

Thanks lots of!