I’m currently trying to run a 3-node cluster across multiple machines using docker with the following commandlines:
docker run --rm --name eventstore -e DB=db -e LOG=log -e PORT=2113 --net=host docker-eventstore --cluster-size=3 --gossip-seed=10.141.3.250:2112,10.141.5.186:2112 --discover-via-dns=false --ext-ip=0.0.0.0 --int-ip=10.141.4.162
docker run --rm --name eventstore -e DB=db -e LOG=log -e PORT=2113 --net=host docker-eventstore --cluster-size=3 --gossip-seed=10.141.4.162:2112,10.141.5.186:2112 --discover-via-dns=false --ext-ip=0.0.0.0 --int-ip=10.141.3.250
docker run --rm --name eventstore -e DB=db -e LOG=log -e PORT=2113 docker-eventstore --cluster-size=3 --gossip-seed=10.141.3.250:2112,10.141.4.162:2112 --discover-via-dns=false --ext-ip=0.0.0.0 --int-ip=10.141.5.186
and sometimes, they seem to get into sync properly for a few minutes before one dumps out with a random crash. Some of the ones I’ve seen so far in the logs are:
(on a slave)
[00001,10,17:48:25.946] ========== [10.141.4.162:2112] SLAVE ASSIGNMENT RECEIVED FROM [10.141.3.250:1112,n/a,{31d1180b-8beb-4e29-8bf5-a1f237ddee92}].
[00001,10,17:48:25.946] ========== [10.141.4.162:2112] IS SLAVE!!! SPARTA!!! MASTER IS [10.141.3.250:2112,{31d1180b-8beb-4e29-8bf5-a1f237ddee92}]
[00001,12,17:48:26.013] Error while processing message in queued handler ‘Projection Core #0’.
Object reference not set to an instance of an object
[00001,12,17:48:26.018] Error while processing message in queued handler ‘Projection Core #0’.
Object reference not set to an instance of an object
[00001,12,17:48:26.058] Global Unhandled Exception occurred.
Object reference not set to an instance of an object
[ERROR] FATAL UNHANDLED EXCEPTION: System.NullReferenceException: Object reference not set to an instance of an object
at EventStore.Core.Bus.QueuedHandlerAutoReset.ReadFromQueue (System.Object o) [0x00000] in :0
at System.Threading.Thread.StartInternal () [0x00000] in :0
(On master)
[00001,10,17:48:28.923] ELECTIONS: (V=6) DONE. ELECTED MASTER = 10.141.3.250:2112,{31d1180b-8beb-4e29-8bf5-a1f237ddee92}. ME=10.141.3.250:2112,{31d1180b-8beb-4e29-8bf5-a1f237ddee92}.
[00001,07,17:48:28.924] === Writing E1@2370:{77ef265f-54a6-4323-9cdf-924ac4bcd352} (previous epoch at 0).
[00001,21,17:48:28.943] Internal TCP connection accepted: [Normal, 10.141.5.186:50245, L10.141.3.250:1112, {5b390fca-40ed-4c79-b425-60325ef8de8f}].
[00001,07,17:48:28.951] === Update Last Epoch E1@2370:{77ef265f-54a6-4323-9cdf-924ac4bcd352} (previous epoch at 0).
[00001,10,17:48:29.007] SUBSCRIBE REQUEST from [10.141.5.186:1112,C:{5b390fca-40ed-4c79-b425-60325ef8de8f},S:{b6ae3ed5-732a-4a7c-bdf6-6ff797c29bf4},0(0x0),]…
[00001,10,17:48:29.007] Subscribed replica [10.141.5.186:1112,S:b6ae3ed5-732a-4a7c-bdf6-6ff797c29bf4] for data send at 0 (0x0).
no object of size 974521624
Stacktrace:
at <0xffffffff>
at (wrapper managed-to-native) object.icall_wrapper_mono_object_new_fast (intptr) <0xffffffff>
at EventStore.Core.Helpers.IODispatcherAsync/c__AnonStorey0.<>m__0 (System.Collections.Generic.IEnumerator1<EventStore.Core.Helpers.IODispatcherAsync/Step>) <0x0002f> at EventStore.Core.Helpers.IODispatcherAsync.Run (System.Collections.Generic.IEnumerator
1<EventStore.Core.Helpers.IODispatcherAsync/Step>) <0x00044>
at EventStore.Core.Helpers.IODispatcherAsync/c__AnonStorey6/c__AnonStorey7.<>m__0 (EventStore.Core.Helpers.IODispatcherDelayedMessage) <0x00084>
at EventStore.Core.Messaging.RequestResponseDispatcher2.Handle (TResponse) <0x0010a> at EventStore.Core.Messaging.RequestResponseDispatcher
2.EventStore.Core.Bus.IHandle.Handle (TResponse) <0x00019>
at EventStore.Core.Bus.MessageHandler`1.TryHandle (EventStore.Core.Messaging.Message) <0x000b1>
at EventStore.Core.Bus.InMemoryBus.Publish (EventStore.Core.Messaging.Message) <0x0010c>
at EventStore.Core.Bus.InMemoryBus.Handle (EventStore.Core.Messaging.Message) <0x00019>
at EventStore.Core.Bus.QueuedHandlerAutoReset.ReadFromQueue (object) <0x0022c>
at System.Threading.Thread.StartInternal () <0x0009b>
at (wrapper runtime-invoke) object.runtime_invoke_void__this (object,intptr,intptr,intptr) <0xffffffff>
Native stacktrace:
./clusternode() [0x612962]
/lib/x86_64-linux-gnu/libpthread.so.0(+0xfc90) [0x7f6b906cac90]
/lib/x86_64-linux-gnu/libc.so.6(gsignal+0x37) [0x7f6b9032de37]
/lib/x86_64-linux-gnu/libc.so.6(abort+0x148) [0x7f6b9032f528]
./clusternode() [0x570ab9]
./clusternode() [0x570cbf]
./clusternode() [0x570d62]
./clusternode() [0x51a858]
./clusternode() [0x51becb]
./clusternode() [0x52228e]
./clusternode() [0x522f42]
./clusternode() [0x51ea81]
./clusternode() [0x50cea7]
./clusternode() [0x50efc3]
./clusternode() [0x516f2a]
./clusternode() [0x517929]
./clusternode() [0x50c273]
./clusternode() [0x50c345]
[0x419d2f93]
Is there something obvious I’m doing wrong here? Sometimes crashes happen right away, and sometimes they can take up to 10 minutes to start happening.
Thanks!