Something is seriously wrong in chunk #25-25

Paulius_Labanauskis · July 3, 2014, 8:02am

I have set up a new GES (3.0.0 RC2) deployment and besides the original system projections set 5 custom projections that emit new events. Then I proceeded to fill the database with about 9GB of data. Some time in the middle the dispatcher that uses the EventStore.ClientAPI could not proceed any further with the following error:

Unexpected error on server: Log record at actual pos 96516326 has too large length: 812016225 bytes, while limit is 16777216 bytes. Something is seriously wrong in chunk #25-25 (chunk-000025.000000).

I tried to increase the position to skip the chunk entirely, but then the same error occurred with chunk #26-26. The issue got temporarily resolved after increasing the position to the last checkpoint. However, due to this issue some events could not be read. SkipDbVerify is false and verification of those chunks passes successfully. I have saved the logs but I could not track how did the data get corrupted because I don’t know the exact time when it happened.

I have already had this problem with a fresh EventStore deployment before and that’s why I decided to redeploy it. Previously, I had the same issue after creating a new projection that on the pre-filled database.

The 3 EventStore nodes run on the virtual machine and the Windows write caching cannot be disabled on the hard drives (probably due to the fact that they are virtual as well). However, the wiki entry of the architectural overview states that it uses fsync/flushfile buffers to ensure data persistence.

What could be the reason for these frequent corruptions and how to prevent them in the future?

Greg_Young1 · July 3, 2014, 8:06am

“I tried to increase the position to skip the chunk entirely, but then the same error occurred with chunk #26-26. The issue got temporarily resolved after increasing the position to the last checkpoint.”

What do you mean you “changed the position” what process were you going through, can you go into a bit more detail?

“The 3 EventStore nodes run on the virtual machine and the Windows write caching cannot be disabled on the hard drives (probably due to the fact that they are virtual as well). However, the wiki entry of the architectural overview states that it uses fsync/flushfile buffers to ensure data persistence.”

Sure we use fsync but fsync doesn’t always work if things like caching are enabled. Windows caching for instance on a local drive enables controller caching as well (so the controller will cache writes even if its not safe).

Paulius_Labanauskis · July 3, 2014, 11:27am

We have a dispatcher, which subscribes to all streams and keeps track of the current position of the transaction file. So, calling this code of the (EventStore.ClientAPI.IEventStoreConnection _client) causes the exception with the error mentioned in the previous post:

Position position = _lastProcessed;

_client.ReadAllEventsForward(position, 500, false);

The _lastProcessed position is always persisted in the file in order to be able to come back to later and to process all events. And since the chunks #25-25 and #26-26 have been corrupted, I had to skip those positions entirely and go to the last position, from which the events could be read successfully. So, it looks like only those few chunks in the middle got corrupted. Currently, the last chunk is chunk-000035.000000.

Greg_Young1 · July 3, 2014, 11:47am

This is a clustered system? What happens if you try connecting to another node? Did you have a power outage (bad shutdown of VM etc etc)

Greg_Young1 · July 3, 2014, 11:48am

Could you explain a bit more about your setup as well?

Paulius_Labanauskis · July 3, 2014, 2:06pm

There are 3 EventStore nodes running on different VMs in a cluster.
There are 2 writers that write events through a cluster DNS. They append events directly to streams using WriteResult EventStore.ClientAPI.IEventStoreConnection.AppendToStream(string stream, int expectedVersion, IEnumerable events, UserCredentials userCredentials = null

A single EventDispatcher connects to a cluster DNS and subscribes to all streams.

I left the events writing through the night and I can see that 2 nodes crashed. The first node crashed on chunk 23, so I guess that did not cause the corruption because the dispatcher read past that node successfully.

NODE 1:

[PID:04048:012 2014.07.01 23:26:14.826 INFO ClusterVNodeControll] ========== [172.22.3.75:2113] IS CATCHING UP!!! BANZAI!!! MASTER IS [172.22.3.76:2113,{3d6b04ec-7c96-48d3-a8bd-27c181a16990}]

[PID:04048:012 2014.07.01 23:26:14.842 INFO ClusterVNodeControll] ========== [172.22.3.75:2113] CLONE ASSIGNMENT RECEIVED FROM [172.22.3.76:1111,n/a,{3d6b04ec-7c96-48d3-a8bd-27c181a16990}].

[PID:04048:012 2014.07.01 23:26:14.842 INFO ClusterVNodeControll] ========== [172.22.3.75:2113] IS CLONE!!! SPARTA!!! MASTER IS [172.22.3.76:2113,{3d6b04ec-7c96-48d3-a8bd-27c181a16990}]

[PID:04048:012 2014.07.01 23:26:14.842 INFO ClusterVNodeControll] ========== [172.22.3.75:2113] SLAVE ASSIGNMENT RECEIVED FROM [172.22.3.76:1111,n/a,{3d6b04ec-7c96-48d3-a8bd-27c181a16990}].

[PID:04048:012 2014.07.01 23:26:14.842 INFO ClusterVNodeControll] ========== [172.22.3.75:2113] IS SLAVE!!! SPARTA!!! MASTER IS [172.22.3.76:2113,{3d6b04ec-7c96-48d3-a8bd-27c181a16990}]

[PID:04048:009 2014.07.01 23:26:14.888 INFO ClusterStorageWriter] === SUBSCRIBED to [172.22.3.76:1111,{3d6b04ec-7c96-48d3-a8bd-27c181a16990}] at 6432480148 (0x17F67DB94). SubscriptionId: {070a1305-9d61-4c00-b64f-b25d3864fb78}.

[PID:04048:009 2014.07.01 23:26:14.888 INFO ClusterStorageWriter] Master [172.22.3.76:1111,{3d6b04ec-7c96-48d3-a8bd-27c181a16990}] subscribed us at 6432480148 (0x17F67DB94), which is less than our writer checkpoint 6433545594 (0x17F781D7A). TRUNCATION IS NEEDED!

[PID:04048:009 2014.07.01 23:26:14.888 INFO ClusterStorageWriter] OFFLINE TRUNCATION IS NEEDED (SubscribedAt 6432480148 (0x17F67DB94) <= LastCommitPosition 6433543849 (0x17F7816A9)). SHUTTING DOWN NODE.

[PID:04048:012 2014.07.01 23:26:14.888 INFO ClusterVNodeControll] ========== [172.22.3.75:2113] IS SHUTTING DOWN!!! FAREWELL, WORLD…

[PID:04048:009 2014.07.01 23:26:14.919 TRACE StorageWriterService] Blocking message DataChunkBulk in StorageWriterService. Message:

DataChunkBulk message: MasterId: 3d6b04ec-7c96-48d3-a8bd-27c181a16990, SubscriptionId: 070a1305-9d61-4c00-b64f-b25d3864fb78, ChunkStartNumber: 23, ChunkEndNumber: 23, SubscriptionPosition: 6432480148, DataBytes length: 782, CompleteChunk: False

At that point NODE 2 was or became the master. However, at one point something weird happened with NODE 3:

[PID:05348:029 2014.07.02 01:01:09.514 TRACE TFChunk ] UNCACHED TFChunk #30-30 (chunk-000030.000000).

[PID:05348:036 2014.07.02 01:19:49.122 TRACE InMemoryBus ] SLOW BUS MSG [Worker #5 Bus]: HttpSend - 312ms. Handler: HttpSendService.

[PID:05348:036 2014.07.02 01:19:49.122 TRACE QueuedHandlerThreadP] SLOW QUEUE MSG [Worker #5]: HttpSend - 312ms. Q: 0/1.

[PID:01440:001 2014.07.02 05:23:44.544 INFO ProgramBase`1 ]

ES VERSION: 3.0.0.0 (master/30f7fa64d73ba65028e5a6ca720639985cac1458, Tue, 18 Mar 2014 20:03:41 +0000)

OS: Windows (Microsoft Windows NT 6.2.9200.0)

RUNTIME: .NET 4.0.30319.18046 (64-bit)

GC: 3 GENERATIONS

LOGS: d:\Store\logs\cluster

SHOW HELP: False ()

SHOW VERSION: False ()

LOGS DIR: d:\Store\logs\cluster (logsdir in config at ‘C:\Windows\TEMP\tmp5D07.tmp’)

CONFIGS: C:\Windows\TEMP\tmp5D07.tmp (–config from command line)

DEFINES: ()

INTERNAL IP: 172.22.3.77 (internalIp in config at ‘C:\Windows\TEMP\tmp5D07.tmp’)

EXTERNAL IP: 172.22.3.77 (externalIp in config at ‘C:\Windows\TEMP\tmp5D07.tmp’)

INTERNAL HTTP PORT: 2113 (internalHttpPort in config at ‘C:\Windows\TEMP\tmp5D07.tmp’)

EXTERNAL HTTP PORT: 2114 (externalHttpPort in config at ‘C:\Windows\TEMP\tmp5D07.tmp’)

INTERNAL TCP PORT: 1111 (internalTcpPort in config at ‘C:\Windows\TEMP\tmp5D07.tmp’)

INTERNAL SECURE TCP PORT: 0 ()

EXTERNAL TCP PORT: 1112 (externalTcpPort in config at ‘C:\Windows\TEMP\tmp5D07.tmp’)

EXTERNAL SECURE TCP PORT: 0 ()

FORCE: False ()

CLUSTER SIZE: 3 (clusterSize in config at ‘C:\Windows\TEMP\tmp5D07.tmp’)

MIN FLUSH DELAY MS: 2 ()

NODE PRIORITY: 0 ()

COMMIT COUNT: 2 ()

PREPARE COUNT: 2 ()

MAX MEM TABLE SIZE: 1000000 ()

DISCOVER VIA DNS: True (useDnsDiscovery in config at ‘C:\Windows\TEMP\tmp5D07.tmp’)

CLUSTER DNS: eventstore.dev3.adform.com (clusterDns in config at ‘C:\Windows\TEMP\tmp5D07.tmp’)

CLUSTER GOSSIP PORT: 2113 (clusterGossipPort in config at ‘C:\Windows\TEMP\tmp5D07.tmp’)

GOSSIP SEEDS: ()

STATS PERIOD SEC: 30 ()

CACHED CHUNKS: -1 ()

CHUNKS CACHE SIZE: 536871424 ()

DB PATH: d:\Store\Banners (db in config at ‘C:\Windows\TEMP\tmp5D07.tmp’)

IN MEM DB: False (memDb in config at ‘C:\Windows\TEMP\tmp5D07.tmp’)

SKIP DB VERIFY: False ()

RUN PROJECTIONS: All (runProjections in config at ‘C:\Windows\TEMP\tmp5D07.tmp’)

PROJECTION THREADS: 3 ()

WORKER THREADS: 5 ()

HTTP PREFIXES: ()

ENABLE TRUSTED AUTH: False ()

CERTIFICATE STORE: ()

CERTIFICATE NAME: ()

CERTIFICATE FILE: ()

CERTIFICATE PASSWORD: ()

USE INTERNAL SSL: False ()

SSL TARGET HOST: n/a ()

SSL VALIDATE SERVER: True ()

AUTHENTICATION TYPE: internal ()

AUTHENTICATION CONFIG FILE: ()

PREPARE TIMEOUT MS: 2000 ()

COMMIT TIMEOUT MS: 2000 ()

DISABLE SCAVENGE MERGING: False ()

GOSSIP ON EXT: True ()

STATS ON EXT: True ()

ADMIN ON EXT: True ()

[PID:01440:001 2014.07.02 05:23:44.856 INFO ProgramBase`1 ] Quorum size set to 2

It looks like it had crashed and then restarted, but there are no errors or events in Windows Event Log. So, I don’t think there was any system failure.

I don’t see any weird behavior while working with chunks #25-25 or #26-26 though.

Greg_Young1 · July 3, 2014, 2:10pm

[PID:05348:036 2014.07.02 01:19:49.122 TRACE QueuedHandlerThreadP] SLOW QUEUE MSG [Worker #5]: HttpSend - 312ms. Q: 0/1.

[PID:01440:001 2014.07.02 05:23:44.544 INFO ProgramBase`1 ]

What happened in the 4 hours in between? Is something monitoring the process to start the process back up again?

Paulius_Labanauskis · July 4, 2014, 7:51am

None that I could find in our code. EventStore is started under a TopShelf service and the service keeps running even after EventStore node is dead and does not detect nor try to restart the node – we do it manually. So, in that instance it looks like somehow the node restarted by itself.

By the way, same issue happened yesterday. I created the same projection on a database containing ~5,000,000 streams and ~20 GB in size. The projection was working on approximately 2,000,000 of those streams filtered by category. After about 90% complete the master node went down and the chunk became corrupted even after completing the projection on a new master node.

Greg_Young1 · July 4, 2014, 8:23am

Can you send the log of the first master node?

Also when you say virtualized disks are these disks running over a network
by chance (eg san etc?)

Paulius_Labanauskis · July 4, 2014, 11:54am

[PID:05348:036 2014.07.02 01:19:49.122 TRACE QueuedHandlerThreadP] SLOW QUEUE MSG [Worker #5]: HttpSend - 312ms. Q: 0/1.

[PID:01440:001 2014.07.02 05:23:44.544 INFO ProgramBase`1 ]

What happened in the 4 hours in between? Is something monitoring the process to start the process back up again?

Sorry,

Please disregard this issue. Since EventStore logs display different time than the server time (-3 hours, probably UTC time), it was started manually at that time because it was already 8:23 AM local time. I will send the logs privately.

Paulius_Labanauskis · July 4, 2014, 12:05pm

What I meant to say was that the node went down nevertheless. But it was brought back manually.

Greg_Young1 · July 4, 2014, 12:13pm

What happened between 11:18 and 11:20 on the node (restart)

Greg_Young1 · July 4, 2014, 12:34pm

Can you send me your command line/configs as well?

Greg_Young1 · July 4, 2014, 12:43pm

Here are some examples of slow disks causing issues.

[PID:03724:015 2014.07.02 00:00:37.907 TRACE InMemoryBus ] SLOW BUS MSG [bus]: WriteEventsCompleted - 1015ms. Handler: RequestResponseDispatcher`2.

[PID:03724:015 2014.07.02 00:00:37.907 TRACE QueuedHandlerMRES ] SLOW QUEUE MSG [Projection Core #1]: WriteEventsCompleted - 1015ms. Q: 36/37.

[PID:03724:009 2014.07.02 00:00:37.939 TRACE QueuedHandlerMRES ] SLOW QUEUE MSG [StorageWriterQueue]: WritePrepares - 1046ms. Q: 0/2.

[PID:03724:014 2014.07.02 00:00:37.939 TRACE InMemoryBus ] SLOW BUS MSG [bus]: UnwrapEnvelopeMessage - 1046ms. Handler: UnwrapEnvelopeHandler.

[PID:03724:014 2014.07.02 00:00:37.939 TRACE QueuedHandlerMRES ] SLOW QUEUE MSG [Projection Core #0]: UnwrapEnvelopeMessage - 1046ms. Q: 0/143.

[PID:03724:037 2014.07.02 00:00:37.939 TRACE QueuedHandlerThreadP] SLOW QUEUE MSG [StorageReaderQueue #3]: ReadAllEventsForward - 1046ms. Q: 0/8.

Thats 1 second to do a write / 1 second to do a read (small)

Paulius_Labanauskis · July 4, 2014, 3:03pm

We are trying to fill the database again and sometimes these errors pop up:

[PID:02292:012 2014.07.04 13:16:53.309 ERROR QueuedHandlerMRES ] Error while processing message EventStore.Core.Messages.StorageMessage+RequestCompleted in queued handler ‘MainQueue’.

System.InvalidOperationException: Should never complete request twice.

at EventStore.Core.Services.RequestManager.RequestManagementService.Handle(RequestCompleted message) in c:\data\EventStore\src\EventStore\EventStore.Core\Services\RequestManager\RequestManagementService.cs:line 129

at EventStore.Core.Bus.MessageHandler`1.TryHandle(Message message) in c:\data\EventStore\src\EventStore\EventStore.Core\Bus\MessageHandler.cs:line 60

at EventStore.Core.Bus.InMemoryBus.Publish(Message message) in c:\data\EventStore\src\EventStore\EventStore.Core\Bus\InMemoryBus.cs:line 351

at EventStore.Core.Services.VNode.VNodeFSM.Handle(Message message) in c:\data\EventStore\src\EventStore\EventStore.Core\Services\VNode\VNodeFSM.cs:line 142

at EventStore.Core.Bus.QueuedHandlerMRES.ReadFromQueue(Object o) in c:\data\EventStore\src\EventStore\EventStore.Core\Bus\QueuedHandlerMRES.cs:line 148

Why would they occur?

Greg_Young1 · July 4, 2014, 3:30pm

Forward me the rest of your log?

Greg_Young1 · July 4, 2014, 3:58pm

That error is a false positive I am pushing a small change in regard to it now. It can only happen in a well timed retrying scenario which from previous logs you were retrying alot so it seems reasonable you received it. I should have the change pushed within a few minutes

Greg_Young1 · July 4, 2014, 4:00pm

Actually I take that back. Researching more.