There are 3 EventStore nodes running on different VMs in a cluster.
There are 2 writers that write events through a cluster DNS. They append events directly to streams using WriteResult EventStore.ClientAPI.IEventStoreConnection.AppendToStream(string stream, int expectedVersion, IEnumerable events, UserCredentials userCredentials = null
)
A single EventDispatcher connects to a cluster DNS and subscribes to all streams.
I left the events writing through the night and I can see that 2 nodes crashed. The first node crashed on chunk 23, so I guess that did not cause the corruption because the dispatcher read past that node successfully.
NODE 1:
[PID:04048:012 2014.07.01 23:26:14.826 INFO ClusterVNodeControll] ========== [172.22.3.75:2113] IS CATCHING UP!!! BANZAI!!! MASTER IS [172.22.3.76:2113,{3d6b04ec-7c96-48d3-a8bd-27c181a16990}]
[PID:04048:012 2014.07.01 23:26:14.842 INFO ClusterVNodeControll] ========== [172.22.3.75:2113] CLONE ASSIGNMENT RECEIVED FROM [172.22.3.76:1111,n/a,{3d6b04ec-7c96-48d3-a8bd-27c181a16990}].
[PID:04048:012 2014.07.01 23:26:14.842 INFO ClusterVNodeControll] ========== [172.22.3.75:2113] IS CLONE!!! SPARTA!!! MASTER IS [172.22.3.76:2113,{3d6b04ec-7c96-48d3-a8bd-27c181a16990}]
[PID:04048:012 2014.07.01 23:26:14.842 INFO ClusterVNodeControll] ========== [172.22.3.75:2113] SLAVE ASSIGNMENT RECEIVED FROM [172.22.3.76:1111,n/a,{3d6b04ec-7c96-48d3-a8bd-27c181a16990}].
[PID:04048:012 2014.07.01 23:26:14.842 INFO ClusterVNodeControll] ========== [172.22.3.75:2113] IS SLAVE!!! SPARTA!!! MASTER IS [172.22.3.76:2113,{3d6b04ec-7c96-48d3-a8bd-27c181a16990}]
[PID:04048:009 2014.07.01 23:26:14.888 INFO ClusterStorageWriter] === SUBSCRIBED to [172.22.3.76:1111,{3d6b04ec-7c96-48d3-a8bd-27c181a16990}] at 6432480148 (0x17F67DB94). SubscriptionId: {070a1305-9d61-4c00-b64f-b25d3864fb78}.
[PID:04048:009 2014.07.01 23:26:14.888 INFO ClusterStorageWriter] Master [172.22.3.76:1111,{3d6b04ec-7c96-48d3-a8bd-27c181a16990}] subscribed us at 6432480148 (0x17F67DB94), which is less than our writer checkpoint 6433545594 (0x17F781D7A). TRUNCATION IS NEEDED!
[PID:04048:009 2014.07.01 23:26:14.888 INFO ClusterStorageWriter] OFFLINE TRUNCATION IS NEEDED (SubscribedAt 6432480148 (0x17F67DB94) <= LastCommitPosition 6433543849 (0x17F7816A9)). SHUTTING DOWN NODE.
[PID:04048:012 2014.07.01 23:26:14.888 INFO ClusterVNodeControll] ========== [172.22.3.75:2113] IS SHUTTING DOWN!!! FAREWELL, WORLD…
[PID:04048:009 2014.07.01 23:26:14.919 TRACE StorageWriterService] Blocking message DataChunkBulk in StorageWriterService. Message:
DataChunkBulk message: MasterId: 3d6b04ec-7c96-48d3-a8bd-27c181a16990, SubscriptionId: 070a1305-9d61-4c00-b64f-b25d3864fb78, ChunkStartNumber: 23, ChunkEndNumber: 23, SubscriptionPosition: 6432480148, DataBytes length: 782, CompleteChunk: False
At that point NODE 2 was or became the master. However, at one point something weird happened with NODE 3:
[PID:05348:029 2014.07.02 01:01:09.514 TRACE TFChunk ] UNCACHED TFChunk #30-30 (chunk-000030.000000).
[PID:05348:036 2014.07.02 01:19:49.122 TRACE InMemoryBus ] SLOW BUS MSG [Worker #5 Bus]: HttpSend - 312ms. Handler: HttpSendService.
[PID:05348:036 2014.07.02 01:19:49.122 TRACE QueuedHandlerThreadP] SLOW QUEUE MSG [Worker #5]: HttpSend - 312ms. Q: 0/1.
[PID:01440:001 2014.07.02 05:23:44.544 INFO ProgramBase`1 ]
ES VERSION: 3.0.0.0 (master/30f7fa64d73ba65028e5a6ca720639985cac1458, Tue, 18 Mar 2014 20:03:41 +0000)
OS: Windows (Microsoft Windows NT 6.2.9200.0)
RUNTIME: .NET 4.0.30319.18046 (64-bit)
GC: 3 GENERATIONS
LOGS: d:\Store\logs\cluster
SHOW HELP: False ()
SHOW VERSION: False ()
LOGS DIR: d:\Store\logs\cluster (logsdir in config at ‘C:\Windows\TEMP\tmp5D07.tmp’)
CONFIGS: C:\Windows\TEMP\tmp5D07.tmp (–config from command line)
DEFINES: ()
INTERNAL IP: 172.22.3.77 (internalIp in config at ‘C:\Windows\TEMP\tmp5D07.tmp’)
EXTERNAL IP: 172.22.3.77 (externalIp in config at ‘C:\Windows\TEMP\tmp5D07.tmp’)
INTERNAL HTTP PORT: 2113 (internalHttpPort in config at ‘C:\Windows\TEMP\tmp5D07.tmp’)
EXTERNAL HTTP PORT: 2114 (externalHttpPort in config at ‘C:\Windows\TEMP\tmp5D07.tmp’)
INTERNAL TCP PORT: 1111 (internalTcpPort in config at ‘C:\Windows\TEMP\tmp5D07.tmp’)
INTERNAL SECURE TCP PORT: 0 ()
EXTERNAL TCP PORT: 1112 (externalTcpPort in config at ‘C:\Windows\TEMP\tmp5D07.tmp’)
EXTERNAL SECURE TCP PORT: 0 ()
FORCE: False ()
CLUSTER SIZE: 3 (clusterSize in config at ‘C:\Windows\TEMP\tmp5D07.tmp’)
MIN FLUSH DELAY MS: 2 ()
NODE PRIORITY: 0 ()
COMMIT COUNT: 2 ()
PREPARE COUNT: 2 ()
MAX MEM TABLE SIZE: 1000000 ()
DISCOVER VIA DNS: True (useDnsDiscovery in config at ‘C:\Windows\TEMP\tmp5D07.tmp’)
CLUSTER DNS: eventstore.dev3.adform.com (clusterDns in config at ‘C:\Windows\TEMP\tmp5D07.tmp’)
CLUSTER GOSSIP PORT: 2113 (clusterGossipPort in config at ‘C:\Windows\TEMP\tmp5D07.tmp’)
GOSSIP SEEDS: ()
STATS PERIOD SEC: 30 ()
CACHED CHUNKS: -1 ()
CHUNKS CACHE SIZE: 536871424 ()
DB PATH: d:\Store\Banners (db in config at ‘C:\Windows\TEMP\tmp5D07.tmp’)
IN MEM DB: False (memDb in config at ‘C:\Windows\TEMP\tmp5D07.tmp’)
SKIP DB VERIFY: False ()
RUN PROJECTIONS: All (runProjections in config at ‘C:\Windows\TEMP\tmp5D07.tmp’)
PROJECTION THREADS: 3 ()
WORKER THREADS: 5 ()
HTTP PREFIXES: ()
ENABLE TRUSTED AUTH: False ()
CERTIFICATE STORE: ()
CERTIFICATE NAME: ()
CERTIFICATE FILE: ()
CERTIFICATE PASSWORD: ()
USE INTERNAL SSL: False ()
SSL TARGET HOST: n/a ()
SSL VALIDATE SERVER: True ()
AUTHENTICATION TYPE: internal ()
AUTHENTICATION CONFIG FILE: ()
PREPARE TIMEOUT MS: 2000 ()
COMMIT TIMEOUT MS: 2000 ()
DISABLE SCAVENGE MERGING: False ()
GOSSIP ON EXT: True ()
STATS ON EXT: True ()
ADMIN ON EXT: True ()
[PID:01440:001 2014.07.02 05:23:44.856 INFO ProgramBase`1 ] Quorum size set to 2
It looks like it had crashed and then restarted, but there are no errors or events in Windows Event Log. So, I don’t think there was any system failure.
I don’t see any weird behavior while working with chunks #25-25 or #26-26 though.