Dockerized cluster nodes crashing randomly after several minutes of uptime

Ben_Salem · May 12, 2015, 5:51pm

I’m currently trying to run a 3-node cluster across multiple machines using docker with the following commandlines:
docker run --rm --name eventstore -e DB=db -e LOG=log -e PORT=2113 --net=host docker-eventstore --cluster-size=3 --gossip-seed=10.141.3.250:2112,10.141.5.186:2112 --discover-via-dns=false --ext-ip=0.0.0.0 --int-ip=10.141.4.162

docker run --rm --name eventstore -e DB=db -e LOG=log -e PORT=2113 --net=host docker-eventstore --cluster-size=3 --gossip-seed=10.141.4.162:2112,10.141.5.186:2112 --discover-via-dns=false --ext-ip=0.0.0.0 --int-ip=10.141.3.250

docker run --rm --name eventstore -e DB=db -e LOG=log -e PORT=2113 docker-eventstore --cluster-size=3 --gossip-seed=10.141.3.250:2112,10.141.4.162:2112 --discover-via-dns=false --ext-ip=0.0.0.0 --int-ip=10.141.5.186

and sometimes, they seem to get into sync properly for a few minutes before one dumps out with a random crash. Some of the ones I’ve seen so far in the logs are:
(on a slave)
[00001,10,17:48:25.946] ========== [10.141.4.162:2112] SLAVE ASSIGNMENT RECEIVED FROM [10.141.3.250:1112,n/a,{31d1180b-8beb-4e29-8bf5-a1f237ddee92}].
[00001,10,17:48:25.946] ========== [10.141.4.162:2112] IS SLAVE!!! SPARTA!!! MASTER IS [10.141.3.250:2112,{31d1180b-8beb-4e29-8bf5-a1f237ddee92}]
[00001,12,17:48:26.013] Error while processing message in queued handler ‘Projection Core #0’.
Object reference not set to an instance of an object
[00001,12,17:48:26.018] Error while processing message in queued handler ‘Projection Core #0’.
Object reference not set to an instance of an object
[00001,12,17:48:26.058] Global Unhandled Exception occurred.
Object reference not set to an instance of an object
[ERROR] FATAL UNHANDLED EXCEPTION: System.NullReferenceException: Object reference not set to an instance of an object
at EventStore.Core.Bus.QueuedHandlerAutoReset.ReadFromQueue (System.Object o) [0x00000] in :0
at System.Threading.Thread.StartInternal () [0x00000] in :0

(On master)
[00001,10,17:48:28.923] ELECTIONS: (V=6) DONE. ELECTED MASTER = 10.141.3.250:2112,{31d1180b-8beb-4e29-8bf5-a1f237ddee92}. ME=10.141.3.250:2112,{31d1180b-8beb-4e29-8bf5-a1f237ddee92}.
[00001,07,17:48:28.924] === Writing E1@2370:{77ef265f-54a6-4323-9cdf-924ac4bcd352} (previous epoch at 0).
[00001,21,17:48:28.943] Internal TCP connection accepted: [Normal, 10.141.5.186:50245, L10.141.3.250:1112, {5b390fca-40ed-4c79-b425-60325ef8de8f}].
[00001,07,17:48:28.951] === Update Last Epoch E1@2370:{77ef265f-54a6-4323-9cdf-924ac4bcd352} (previous epoch at 0).
[00001,10,17:48:29.007] SUBSCRIBE REQUEST from [10.141.5.186:1112,C:{5b390fca-40ed-4c79-b425-60325ef8de8f},S:{b6ae3ed5-732a-4a7c-bdf6-6ff797c29bf4},0(0x0),]…
[00001,10,17:48:29.007] Subscribed replica [10.141.5.186:1112,S:b6ae3ed5-732a-4a7c-bdf6-6ff797c29bf4] for data send at 0 (0x0).
no object of size 974521624

Stacktrace:

at <0xffffffff>
at (wrapper managed-to-native) object.icall_wrapper_mono_object_new_fast (intptr) <0xffffffff>
at EventStore.Core.Helpers.IODispatcherAsync/c__AnonStorey0.<>m__0 (System.Collections.Generic.IEnumerator1<EventStore.Core.Helpers.IODispatcherAsync/Step>) <0x0002f> at EventStore.Core.Helpers.IODispatcherAsync.Run (System.Collections.Generic.IEnumerator1<EventStore.Core.Helpers.IODispatcherAsync/Step>) <0x00044>
at EventStore.Core.Helpers.IODispatcherAsync/c__AnonStorey6/c__AnonStorey7.<>m__0 (EventStore.Core.Helpers.IODispatcherDelayedMessage) <0x00084>
at EventStore.Core.Messaging.RequestResponseDispatcher2.Handle (TResponse) <0x0010a> at EventStore.Core.Messaging.RequestResponseDispatcher2.EventStore.Core.Bus.IHandle.Handle (TResponse) <0x00019>
at EventStore.Core.Bus.MessageHandler`1.TryHandle (EventStore.Core.Messaging.Message) <0x000b1>
at EventStore.Core.Bus.InMemoryBus.Publish (EventStore.Core.Messaging.Message) <0x0010c>
at EventStore.Core.Bus.InMemoryBus.Handle (EventStore.Core.Messaging.Message) <0x00019>
at EventStore.Core.Bus.QueuedHandlerAutoReset.ReadFromQueue (object) <0x0022c>
at System.Threading.Thread.StartInternal () <0x0009b>
at (wrapper runtime-invoke) object.runtime_invoke_void__this (object,intptr,intptr,intptr) <0xffffffff>

Native stacktrace:

    ./clusternode() [0x612962]
    /lib/x86_64-linux-gnu/libpthread.so.0(+0xfc90) [0x7f6b906cac90]
    /lib/x86_64-linux-gnu/libc.so.6(gsignal+0x37) [0x7f6b9032de37]
    /lib/x86_64-linux-gnu/libc.so.6(abort+0x148) [0x7f6b9032f528]
    ./clusternode() [0x570ab9]
    ./clusternode() [0x570cbf]
    ./clusternode() [0x570d62]
    ./clusternode() [0x51a858]
    ./clusternode() [0x51becb]
    ./clusternode() [0x52228e]
    ./clusternode() [0x522f42]
    ./clusternode() [0x51ea81]
    ./clusternode() [0x50cea7]
    ./clusternode() [0x50efc3]
    ./clusternode() [0x516f2a]
    ./clusternode() [0x517929]
    ./clusternode() [0x50c273]
    ./clusternode() [0x50c345]
    [0x419d2f93]

Is there something obvious I’m doing wrong here? Sometimes crashes happen right away, and sometimes they can take up to 10 minutes to start happening.

Thanks!

Greg_Young1 · May 12, 2015, 8:00pm

OS/ES version?

Ben_Salem · May 12, 2015, 8:23pm

We’re currently running CoreOS 607.0.0!

Ben_Salem · May 12, 2015, 8:24pm

Oh. And ES 3.0.3. Sent a bit too fast there.

Greg_Young1 · May 12, 2015, 8:52pm

You are getting I am guessing a sigv (not included). Basically from
the stack trace malloc is failing ... This is nothing to do with our
code but due to library mismatches/code mismatches/very low level GC
bugs etc (which are normally not platform dependent). My guess would
be a glibc mismatch or whatever the issue we have been seeing between
ubuntu 14.04 and 14.10 (e.g. something mismatches and causes
stack/heap corruption). I would also guess the issue is not happening
in the same place every time?

Greg_Young1 · May 12, 2015, 9:56pm

there are many running 14.10 without issue as far as I know that looks
like the typical 14.04 issue (its stack corruption)

Glenn_Goodrich · May 13, 2015, 1:23pm

Hey Ben,

Are you running the 3 containers on one Docker host or 1 each on 3 different hosts? I am trying to get up cluster with Docker, but the IP address thing is clumsy. I can manually create 3 containers and grab the IPs, then get the cluster up… but the web interface gives a "

Bad Request (Invalid host)" error and I see some errors in the logs about “Parameter name: driveName”.

I don’t mean to hijack the thread, but I’d love any advice on getting an ES cluster up with Docker.

Glenn

Greg_Young1 · May 13, 2015, 1:26pm

"but the web interface gives a bar request invalid host" you most
likely are missing a http-prefix.

"Parameter name: driveName": a full log message would help

Ben_Salem · May 13, 2015, 4:56pm

I got my issue fixed! Hilariously, it wasn’t an Ubuntu issue at all. Eventstore seems to just be totally incompatible with Coreos 607.0.0 - upgrading to latest CoreOS fixed everything!

Glenn_Goodrich · May 13, 2015, 7:07pm

So, that did it. Adding the http-prefix, I was able to get to the web interface.

So, Ben, are you across multiple machines? I am trying to figure out how I would deploy an ES cluster on something like EC2 Container Service. I have to know all the ip addresses ahead of time, which is problematic (I think)

Any thoughts? Am I making this too hard?

Ben_Salem · May 13, 2015, 7:25pm

Yeah, we’re currently running across multiple instances - what we’re looking to do eventually, rather than having to know all of the IPs ahead of time, is use the DNS discovery flag, and just have each node register as they come up, so we can have them automatically find each other. I haven’t gotten to that point of the setup yet - trying to make sure our prototype works end-to-end with the hardcoded IPs first, but that’ll be the next step.