Dockerized cluster nodes crashing randomly after several minutes of uptime

I’m currently trying to run a 3-node cluster across multiple machines using docker with the following commandlines:
docker run --rm --name eventstore -e DB=db -e LOG=log -e PORT=2113 --net=host docker-eventstore --cluster-size=3 --gossip-seed=10.141.3.250:2112,10.141.5.186:2112 --discover-via-dns=false --ext-ip=0.0.0.0 --int-ip=10.141.4.162

docker run --rm --name eventstore -e DB=db -e LOG=log -e PORT=2113 --net=host docker-eventstore --cluster-size=3 --gossip-seed=10.141.4.162:2112,10.141.5.186:2112 --discover-via-dns=false --ext-ip=0.0.0.0 --int-ip=10.141.3.250

docker run --rm --name eventstore -e DB=db -e LOG=log -e PORT=2113 docker-eventstore --cluster-size=3 --gossip-seed=10.141.3.250:2112,10.141.4.162:2112 --discover-via-dns=false --ext-ip=0.0.0.0 --int-ip=10.141.5.186

and sometimes, they seem to get into sync properly for a few minutes before one dumps out with a random crash. Some of the ones I’ve seen so far in the logs are:
(on a slave)
[00001,10,17:48:25.946] ========== [10.141.4.162:2112] SLAVE ASSIGNMENT RECEIVED FROM [10.141.3.250:1112,n/a,{31d1180b-8beb-4e29-8bf5-a1f237ddee92}].
[00001,10,17:48:25.946] ========== [10.141.4.162:2112] IS SLAVE!!! SPARTA!!! MASTER IS [10.141.3.250:2112,{31d1180b-8beb-4e29-8bf5-a1f237ddee92}]
[00001,12,17:48:26.013] Error while processing message in queued handler ‘Projection Core #0’.
Object reference not set to an instance of an object
[00001,12,17:48:26.018] Error while processing message in queued handler ‘Projection Core #0’.
Object reference not set to an instance of an object
[00001,12,17:48:26.058] Global Unhandled Exception occurred.
Object reference not set to an instance of an object
[ERROR] FATAL UNHANDLED EXCEPTION: System.NullReferenceException: Object reference not set to an instance of an object
at EventStore.Core.Bus.QueuedHandlerAutoReset.ReadFromQueue (System.Object o) [0x00000] in :0
at System.Threading.Thread.StartInternal () [0x00000] in :0

(On master)
[00001,10,17:48:28.923] ELECTIONS: (V=6) DONE. ELECTED MASTER = 10.141.3.250:2112,{31d1180b-8beb-4e29-8bf5-a1f237ddee92}. ME=10.141.3.250:2112,{31d1180b-8beb-4e29-8bf5-a1f237ddee92}.
[00001,07,17:48:28.924] === Writing E1@2370:{77ef265f-54a6-4323-9cdf-924ac4bcd352} (previous epoch at 0).
[00001,21,17:48:28.943] Internal TCP connection accepted: [Normal, 10.141.5.186:50245, L10.141.3.250:1112, {5b390fca-40ed-4c79-b425-60325ef8de8f}].
[00001,07,17:48:28.951] === Update Last Epoch E1@2370:{77ef265f-54a6-4323-9cdf-924ac4bcd352} (previous epoch at 0).
[00001,10,17:48:29.007] SUBSCRIBE REQUEST from [10.141.5.186:1112,C:{5b390fca-40ed-4c79-b425-60325ef8de8f},S:{b6ae3ed5-732a-4a7c-bdf6-6ff797c29bf4},0(0x0),]…
[00001,10,17:48:29.007] Subscribed replica [10.141.5.186:1112,S:b6ae3ed5-732a-4a7c-bdf6-6ff797c29bf4] for data send at 0 (0x0).
no object of size 974521624

Stacktrace:

at <0xffffffff>
at (wrapper managed-to-native) object.icall_wrapper_mono_object_new_fast (intptr) <0xffffffff>
at EventStore.Core.Helpers.IODispatcherAsync/c__AnonStorey0.<>m__0 (System.Collections.Generic.IEnumerator1<EventStore.Core.Helpers.IODispatcherAsync/Step>) <0x0002f> at EventStore.Core.Helpers.IODispatcherAsync.Run (System.Collections.Generic.IEnumerator1<EventStore.Core.Helpers.IODispatcherAsync/Step>) <0x00044>
at EventStore.Core.Helpers.IODispatcherAsync/c__AnonStorey6/c__AnonStorey7.<>m__0 (EventStore.Core.Helpers.IODispatcherDelayedMessage) <0x00084>
at EventStore.Core.Messaging.RequestResponseDispatcher2.Handle (TResponse) <0x0010a> at EventStore.Core.Messaging.RequestResponseDispatcher2.EventStore.Core.Bus.IHandle.Handle (TResponse) <0x00019>
at EventStore.Core.Bus.MessageHandler`1.TryHandle (EventStore.Core.Messaging.Message) <0x000b1>
at EventStore.Core.Bus.InMemoryBus.Publish (EventStore.Core.Messaging.Message) <0x0010c>
at EventStore.Core.Bus.InMemoryBus.Handle (EventStore.Core.Messaging.Message) <0x00019>
at EventStore.Core.Bus.QueuedHandlerAutoReset.ReadFromQueue (object) <0x0022c>
at System.Threading.Thread.StartInternal () <0x0009b>
at (wrapper runtime-invoke) object.runtime_invoke_void__this
(object,intptr,intptr,intptr) <0xffffffff>

Native stacktrace:

    ./clusternode() [0x612962]
    /lib/x86_64-linux-gnu/libpthread.so.0(+0xfc90) [0x7f6b906cac90]
    /lib/x86_64-linux-gnu/libc.so.6(gsignal+0x37) [0x7f6b9032de37]
    /lib/x86_64-linux-gnu/libc.so.6(abort+0x148) [0x7f6b9032f528]
    ./clusternode() [0x570ab9]
    ./clusternode() [0x570cbf]
    ./clusternode() [0x570d62]
    ./clusternode() [0x51a858]
    ./clusternode() [0x51becb]
    ./clusternode() [0x52228e]
    ./clusternode() [0x522f42]
    ./clusternode() [0x51ea81]
    ./clusternode() [0x50cea7]
    ./clusternode() [0x50efc3]
    ./clusternode() [0x516f2a]
    ./clusternode() [0x517929]
    ./clusternode() [0x50c273]
    ./clusternode() [0x50c345]
    [0x419d2f93]

Is there something obvious I’m doing wrong here? Sometimes crashes happen right away, and sometimes they can take up to 10 minutes to start happening.

Thanks!

OS/ES version?

We’re currently running CoreOS 607.0.0!

Oh. And ES 3.0.3. Sent a bit too fast there.

You are getting I am guessing a sigv (not included). Basically from
the stack trace malloc is failing ... This is nothing to do with our
code but due to library mismatches/code mismatches/very low level GC
bugs etc (which are normally not platform dependent). My guess would
be a glibc mismatch or whatever the issue we have been seeing between
ubuntu 14.04 and 14.10 (e.g. something mismatches and causes
stack/heap corruption). I would also guess the issue is not happening
in the same place every time?

there are many running 14.10 without issue as far as I know that looks
like the typical 14.04 issue (its stack corruption)

Hey Ben,

Are you running the 3 containers on one Docker host or 1 each on 3 different hosts? I am trying to get up cluster with Docker, but the IP address thing is clumsy. I can manually create 3 containers and grab the IPs, then get the cluster up… but the web interface gives a "

Bad Request (Invalid host)" error and I see some errors in the logs about “Parameter name: driveName”.

I don’t mean to hijack the thread, but I’d love any advice on getting an ES cluster up with Docker.

Glenn

"but the web interface gives a bar request invalid host" you most
likely are missing a http-prefix.

"Parameter name: driveName": a full log message would help

I got my issue fixed! Hilariously, it wasn’t an Ubuntu issue at all. Eventstore seems to just be totally incompatible with Coreos 607.0.0 - upgrading to latest CoreOS fixed everything!

So, that did it. Adding the http-prefix, I was able to get to the web interface.

So, Ben, are you across multiple machines? I am trying to figure out how I would deploy an ES cluster on something like EC2 Container Service. I have to know all the ip addresses ahead of time, which is problematic (I think)

Any thoughts? Am I making this too hard?

Yeah, we’re currently running across multiple instances - what we’re looking to do eventually, rather than having to know all of the IPs ahead of time, is use the DNS discovery flag, and just have each node register as they come up, so we can have them automatically find each other. I haven’t gotten to that point of the setup yet - trying to make sure our prototype works end-to-end with the hardcoded IPs first, but that’ll be the next step.