Corrupted chunk file of EventStore cluster

Zhengquan_Yang · May 22, 2015, 11:32am

Hi All,

We are running a EventStore cluster with 3 nodes (release-oss-3.0.1) on Windows Server 2008.

Recently the cluster stopped unexpectedly. From the error log, it seems there are some issues with chunk files #14.

Is there any limitation to the size of chunk files to be replicated?

This is the detail errors:

[PID:04348:046 2015.05.15 02:26:07.297 ERROR TcpConnectionManager] Bad request received from ‘master-secure’ [10.10.10.104:2116, L10.10.10.103:55417, {795c1d1c-1985-4229-a534-d0a93bce7fbe}], will stop server. CorrelationId: {deb47e11-f702-4a31-9a3e-eda768c5793d}, Error: Exception during subscribing replica. Connection will be dropped. Error: Log record at actual pos 206719563 has too large length: 1869770786 bytes, while limit is 16777216 bytes. In chunk #14-14 (chunk-000014.000000)…

[PID:04348:013 2015.05.15 02:26:07.671 ERROR QueuedHandlerMRES ] Error while processing message EventStore.Projections.Core.Messages.ReaderSubscriptionManagement+Unsubscribe in queued handler ‘Projection Core #0’.

System.Collections.Generic.KeyNotFoundException: The given key was not present in the dictionary.

at System.Collections.Generic.Dictionary`2.get_Item(TKey key)

at EventStore.Projections.Core.Services.Processing.EventReaderCoreService.Handle(Unsubscribe message) in c:\EventStore\src\EventStore.Projections.Core\Services\Processing\EventReaderCoreService.cs:line 132

at EventStore.Core.Bus.MessageHandler`1.TryHandle(Message message) in c:\EventStore\src\EventStore.Core\Bus\MessageHandler.cs:line 33

at EventStore.Core.Bus.InMemoryBus.Publish(Message message) in c:\EventStore\src\EventStore.Core\Bus\InMemoryBus.cs:line 324

at EventStore.Core.Bus.QueuedHandlerMRES.ReadFromQueue(Object o) in c:\EventStore\src\EventStore.Core\Bus\QueuedHandlerMRES.cs:line 121

[PID:04348:013 2015.05.15 02:26:07.671 ERROR QueuedHandlerMRES ] Error while processing message EventStore.Projections.Core.Messages.ReaderSubscriptionManagement+Unsubscribe in queued handler ‘Projection Core #0’.

System.Collections.Generic.KeyNotFoundException: The given key was not present in the dictionary.

at System.Collections.Generic.Dictionary`2.get_Item(TKey key)

at EventStore.Projections.Core.Services.Processing.EventReaderCoreService.Handle(Unsubscribe message) in c:\EventStore\src\EventStore.Projections.Core\Services\Processing\EventReaderCoreService.cs:line 132

at EventStore.Core.Bus.MessageHandler`1.TryHandle(Message message) in c:\EventStore\src\EventStore.Core\Bus\MessageHandler.cs:line 33

at EventStore.Core.Bus.InMemoryBus.Publish(Message message) in c:\EventStore\src\EventStore.Core\Bus\InMemoryBus.cs:line 324

at EventStore.Core.Bus.QueuedHandlerMRES.ReadFromQueue(Object o) in c:\EventStore\src\EventStore.Core\Bus\QueuedHandlerMRES.cs:line 121

[PID:04348:024 2015.05.15 02:26:08.498 ERROR TcpConnectionManager] Bad request received from ‘master-secure’ [10.10.10.104:2116, L10.10.10.103:55419, {2ee5d42f-c3a5-43a4-86c6-93fa0115b6bb}], will stop server. CorrelationId: {c08ef5c0-38b4-480e-9fbe-431b543eb694}, Error: Exception during subscribing replica. Connection will be dropped. Error: Log record at actual pos 206719563 has too large length: 1869770786 bytes, while limit is 16777216 bytes. In chunk #14-14 (chunk-000014.000000)…

[PID:04348:065 2015.05.15 02:26:09.340 ERROR TcpConnectionManager] Bad request received from ‘master-secure’ [10.10.10.104:2116, L10.10.10.103:55420, {8507b252-4d66-4ec8-92bd-a8180d887365}], will stop server. CorrelationId: {5dff3ba3-7779-4ace-8410-39bc48cc5cac}, Error: Exception during subscribing replica. Connection will be dropped. Error: Log record at actual pos 206719563 has too large length: 1869770786 bytes, while limit is 16777216 bytes. In chunk #14-14 (chunk-000014.000000).

Looking forward for your help.

Thanks in advance.

Greg_Young1 · May 22, 2015, 11:36am

Grab chunk 14 on master and send it to us to take a look at. In the
meantime I imagine that node went down and another node became master
is this correct?

Providing this is correct an easy answer is to delete the db on that
node and it will re-replicate from the other nodes (at 14 chunks I'm
guessing this is about 3.5gb so should be pretty quick as well)

Zhengquan_Yang · May 22, 2015, 11:56am

Thanks for your response.

And each node in the cluster is wrapped with a windows service.

In the wrapper we killed the process directly instead of posting a request tohttp://localhost:2113/admin/shutdown.

Would it cause the chunk file be corrupted?

It seems the limitation is defined in TFConsts. MaxLogRecordSize (Github: link),

Could you also have a explanation about this?

在 2015年5月22日星期五 UTC+8下午7:36:08，Greg Young写道：

Joao_Braganca · May 22, 2015, 12:17pm

EventStore is designed to be pkilled so while possible I doubt that’s it. Check your disk properties -> policies. Is ‘Enable write caching on the device’ checked? If so it should not be.

Greg_Young1 · May 22, 2015, 12:32pm

Thats a big one ... but my guess is there are also levels of
virtualization in between?

The max size there shouldn't be a problem unless of course you
actually were writing 1.8gb events?

Zhengquan_Yang · June 10, 2015, 9:10am

Hi there,

The corruption happened again on chunk #9, it seems all the event in this chunk files are internal system events of $stats-xxxx.

I put the chunk file on SendSpace with link: https://www.sendspace.com/file/kj7b9k

Please help us have a look at this file.

Thanks in advance.

在 2015年5月22日星期五 UTC+8下午8:32:14，Greg Young写道：

Greg_Young1 · June 10, 2015, 11:43am

Are you doing crash/power outage testing etc? Was this a bad read or a
checksum fail? (can't download it now on a crappy wifi)

Zhengquan_Yang · June 11, 2015, 6:34am

Thanks for your reply.

No, we didn’t do that kind of testing.

From the error logs, it seems the error is not related to the checksum.

more details error logs (68MB): https://www.sendspace.com/file/3e201i

Greg_Young1 · June 11, 2015, 1:18pm

On 3.0.1? Can you do me a favor and try the test (from a fresh db with 3.0.5?)

I think this might be an edge condition that was fixed where very
large events (>16mb) were being allowed to be written on one path
through (e.g. missing too large check).

Cheers,

Greg

Zhengquan_Yang · June 11, 2015, 11:25pm

Sure.

And could you please explain more about the testing details and How can we simulate the edge condition?

What do you mean by “one path”? Is it a single EventStore connection?

I compared the two releases and get a list of commits that changed from oss-v3.0.1 to oss-v3.0.5.

https://github.com/EventStore/EventStore/compare/cd14cf8c352d56690142ad46996d21f3e567ca2c…a9a82d955a0ac1db548b1873fa3be66923dc6032

So which commit or issue you mentioned that might have fixed this issue?

Thanks

在 2015年6月11日星期四 UTC+8下午9:18:52，Greg Young写道：

Greg_Young1 · June 12, 2015, 12:13am

I was guessing as I seemed to remember an issue that was related. If
you can send a (running) reproduction that would be great.

Zhengquan_Yang · June 12, 2015, 12:52am

We have a cluster running with 3 EventStore instances on difference machines.
There are around 7 GB chunk files for each.

Do you need the chunk files or just the running instance without database?

Greg_Young1 · June 12, 2015, 12:34pm

Ideally it would be better to not have to deal with a 7gb db. Ideally
some reproducible code (e.g. run this and it will produce the case
would be useful). There was however an issue fixed a while ago though
I don't remember which version which sounds quite familiar to this if
you were writing events bigger than 16mb (basically there was a place
where it wasn't being properly stopped from happening).

Zhengquan_Yang · July 8, 2015, 9:07am

Hi Greg,

We have some new findings about this issue.

There are two EventStore clusters (C1, C2) running in parallel on these 3 nodes.

C1 and C2 are for different environments of same application. This data corruption issue happened on C1 cluster.

We found that one of C2’s instance is incorrectly configured and these is one instance from C1 is listed in the gossip seed.

However we failed to reproduce this issue on local environment by simulating same cluster configurations.

Is this might the root cause that cause the corruption of chunk files and how can we reproduce it?

Thanks

在 2015年6月12日星期五 UTC+8下午8:35:01，Greg Young写道：

Greg_Young1 · July 16, 2015, 12:58pm

Can you be more clear how it was misconfigured?