Event Store server crashes all the times in linux ubuntu 14.10

Tao_Wang · June 5, 2015, 8:29pm

I have run several tests to find out how stable the event store is. My goal is to see whether Event Store server can stay up until my test fills up the disk space. Basically, the test causes Event Store crash all the time.

I am wondering whether there are some configuration issues or known bugs causing these crashes. Any suggestions or measures I can take to prevent these crashes. I am going to run the same test in windows environment. However, I really do not want to go down that path even Event Store works better in windows.

Here is the details about my test.

Test Environment:

Event Store Server:

Three Nodes Cluster in AWS

CoreOS: Linux kernel 4.0.1

CPU(s): 8
CPU Model name: Intel® Xeon® CPU E5-2666 v3 @ 2.90GHz

RAM: 16GB

Ubuntu 14.10 Lib

AWS Node Type: C4.2xlarge

Test 1

Jmeter: TCP protocol

Batch Size: 5000

Function in test: writeEvent call only

Event Data Payload Size: 500 Bytes

Number of concurrent users(jmeter thread numbers) : 1

Targeted Test Duration: 72 hours

Result 1

Event Store server crashed afterfive minutes of the load test with this error message. This error seems related to the setting “projection=all” flag in the Event Store. https://github.com/EventStore/EventStore/issues/459

[00001,06,17:20:36.361] Connection ‘external-normal’ [52.7.113.16:55853, {e67e5bdc-306d-4ec8-b499-cab44d28eb47}] closed: ConnectionReset.
[00001,22,17:20:36.390] External TCP connection accepted: [Normal, 52.7.113.16:55856, L10.243.4.120:1113, {7904b60c-3a43-445c-83bf-2617bdae0db2}].
Stacktrace:

Native stacktrace:

./clusternode() [0x612962]
./clusternode() [0x5beb0b]
./clusternode() [0x4584f3]
/lib/x86_64-linux-gnu/libpthread.so.0(+0xfc90) [0x7f4dce6f4c90]

Debug info from gdb:

jen20 · June 5, 2015, 10:51pm

The third test tells you exactly what the problem there is: Offline truncation is needed, which requires you to restart the node. The commercial manager acts as a supervisor for this - there is no equivalent in the OSS packages though.

Did you build from source or use the downloadable packages? Which version of libc do you have?

Result 2 also looks suspect - even though there is an error reported it’s handled. It’s more likely that all nodes were shut down, and not being supervised, they never restarted?

Greg_Young1 · June 6, 2015, 12:44pm

1) Can look at need to know more about environment version of libc etc

2) I "disabled" projections then had an error with projections.

3) Pretty self explanatory. Supervise the node. There are times its
easier to just have a process commit suicide and restart as opposed to
trying to handle the edge condition.

Dmitry_Aleksandrov · June 7, 2015, 4:58am

Hi All,

Solved same problem by kernel upgrade to 3.19 - http://askubuntu.com/questions/598483/how-can-i-use-kernel-3-19-in-14-04-now

Regards,

Dmitry

jen20 · June 7, 2015, 8:31pm

I think for the next builds going forward we’ll default to building on the latest LTS version of Ubuntu.

Sam_Kanai · June 10, 2015, 9:59pm

James, can you explain more about the root cause of the truncation problem? Why is it an edge case that requires the node to commit suicide? Is online truncation a future feature on your roadmap?

Thanks,

Sam

Greg_Young1 · June 10, 2015, 10:33pm

Supervise the node. This is not uncommon.

Sam_Kanai · June 11, 2015, 12:15am

Thats not what I was asking. I look dimly upon process crashes, even when they are semi-graceful. The presence of a monitoring agent is not relevant. Brass tacks: I’m questioning the robustness of the server implementation, and would like to know more. This seems a reasonable ask for an enterprise grade product, yes?

-Sam

Greg_Young1 · June 11, 2015, 8:09am

The situation is known as a deposed master. You had a master who is no
longer the master (he was partitioned). During this time of
partitioning he may have written transactions to his log while he
still thought he was master that others do not have (as they have
moved on after elections). As such this nodes log needs to be
truncated. A log truncation is an operation thats much better and
safer to do on a restart than inside of an active node (all sorts of
funky race conditions, its far more robust to clean start at this
point iirc RAFT specifies the same).

The commercial version comes with a supervisor.

Cheers,

Greg

Joao_Braganca · June 11, 2015, 9:27am

Because the writes were not acknowledged by a majority of nodes, the client gets an error in this situation right?

Greg_Young1 · June 11, 2015, 10:28am

The client will have received a timeout.

The writes may or may not have reached a majority of nodes. The client
retries on timeouts and will receive an idempotent write if it has
reached or will actually write if it has not.

Greg

jen20 · June 11, 2015, 2:00pm

Actually no it doesn’t seem reasonable to me - and the presence of a monitoring agent is highly relevant. See Erlang.

It’s not a crash, just an exit. This is a perfectly good strategy for dealing with lots of conditions as has been proven over years. The reason you run clusters of commodity hardware is precisely to allow this kind of behaviour.

Sam_Kanai · June 11, 2015, 3:31pm

@Greg - Thanks for the explanation, that helps.

@James - I’m terribly sorry you feel that questions about root causes aren’t reasonable. You can quibble about semantics, but an unplanned exit is no different from a crash to an Ops person, as both generate an incident and require investigation. Only when Ops has sufficient understanding can they disregard exit events. To be clear, I’m not arguing for/against the suicide approach. I just want to know the why.

Related topic: I work with Tao, the original poster. and no master has survived more than 30 minutes under load before losing the heartbeat and dying. This is a distressingly high failure rate and causes a lot of latency spikes on the client. So my question is: is this the best we can expect from GES? Or are we doing something wrong? For example, is this side-effect of AWS network jitter, and would we see better results on-prem?

-Sam

Greg_Young1 · June 11, 2015, 3:39pm

"Related topic: I work with Tao, the original poster. and no master
has survived more than 30 minutes under load before losing the
heartbeat and dying. This is a distressingly high failure rate and
causes a lot of latency spikes on the client. So my question is: is
this the best we can expect from GES? Or are we doing something
wrong? For example, is this side-effect of AWS network jitter, and
would we see better results on-prem?"

I would also be interested in seeing what the CPU usage / TCP settings
etc were on the box.

Sam_Kanai · June 11, 2015, 3:58pm

I can gather that for you. I’ll also gather the OS/GES versions. Thanks again!

Another question: I’ve already opened a query to your enterprise support team about onsite training. If we book an onsite, can we use it to explore questions at this level of technical detail? That would be incredibly helpful.

Finally, I apologize if I’m coming across as overly demanding. We’re a conservative company, so I have to meet a very high bar for new software.

-Sam

Sam_Kanai · June 12, 2015, 6:24pm

Greg,

Ubuntu 14.10 containers on
CoreOS beta 647.0.0 with ES version 3.0.3, in AWS.

Still arranging another perf run, but IIRC the CPU has been very underloaded so far.

What TCP
settings are you looking for, exactly?

-Sam

Ben_Salem · June 12, 2015, 7:36pm

Ok, we’re running another copy of the loadtest and it looks like we’re using about 20% of our available CPU on two machines (approximately 1.5 cores each) and 6-7% of our total CPU on our third machine (about .5 cores). As long as the process isn’t CPU-locked (and I assume it isn’t, seeing how it’s taking up 1.5 CPUs on a couple machines), this feels like we should be totally fine on CPU!

Ben_Salem · June 12, 2015, 8:45pm

We saw our crash again, and we got a whole bunch of stuff happening in the second of the crash. Hope you can find something in here below!

[00001,31,20:09:31.571] External TCP connection accepted: [Normal, 52.7.113.16:59890, L10.243.0.211:1113, {f4f1d01b-21ee-4cb2-b818-79570baa5bc2}].
[00001,31,20:09:36.029] SLOW BUS MSG [Worker #4 Bus]: TcpSend - 4210ms. Handler: TcpSendService.
[00001,21,20:09:36.029] SLOW BUS MSG [Worker #5 Bus]: TcpSend - 4210ms. Handler: TcpSendService.
[00001,31,20:09:36.029] SLOW QUEUE MSG [Worker #4]: TcpSend - 4211ms. Q: 715/715.
[00001,21,20:09:36.029] SLOW QUEUE MSG [Worker #5]: TcpSend - 4211ms. Q: 41/62.
[00001,15,20:09:36.029] SLOW BUS MSG [bus]: UnwrapEnvelopeMessage - 4222ms. Handler: UnwrapEnvelopeHandler.
[00001,26,20:09:36.029] PTables merge finished in 00:00:31.1839287 ([16004992, 16002064] entries merged into 32007056).
[00001,15,20:09:36.029] SLOW QUEUE MSG [Projection Core #2]: UnwrapEnvelopeMessage - 4222ms. Q: 61/313.
[00001,26,20:09:36.029] Loading PTable ‘7ab183a5-522b-4bf6-94df-632d3780d8c8’ started…
[00001,19,20:09:36.029] SLOW QUEUE MSG [StorageReaderQueue #4]: ReadAllEventsForward - 4211ms. Q: 0/0.
[00001,13,20:09:36.029] SLOW BUS MSG [bus]: UnwrapEnvelopeMessage - 4212ms. Handler: UnwrapEnvelopeHandler.
[00001,13,20:09:36.029] SLOW QUEUE MSG [Projection Core #0]: UnwrapEnvelopeMessage - 4212ms. Q: 0/253.
[00001,29,20:09:36.031] ES TcpConnection closed [20:09:36.031: N10.243.8.160:37026, L10.243.0.211:1112, {255fbf12-86bd-4ddc-9d5f-0fa2fed666ca}]:
Received bytes: 2042130, Sent bytes: 21826478575
Send calls: 2941073, callbacks: 2941072
Receive calls: 40978, callbacks: 40978
Close reason: [Success] Socket closed

[00001,07,20:09:36.031] ES TcpConnection closed [20:09:36.031: N52.7.113.16:59890, L10.243.0.211:1113, {f4f1d01b-21ee-4cb2-b818-79570baa5bc2}]:
Received bytes: 2868990, Sent bytes: 95
Send calls: 3, callbacks: 2
Receive calls: 358, callbacks: 358
Close reason: [Success] Socket closed

[00001,27,20:09:36.031] ES TcpConnection closed [20:09:36.031: N10.243.4.120:50437, L10.243.0.211:1112, {572717b1-13b8-482e-8912-2aa2e4c8941e}]:
Received bytes: 13683, Sent bytes: 116019785
Send calls: 17087, callbacks: 17086
Receive calls: 253, callbacks: 253
Close reason: [Success] Socket closed

[00001,29,20:09:36.031] Connection ‘internal-normal’ [10.243.8.160:37026, {255fbf12-86bd-4ddc-9d5f-0fa2fed666ca}] closed: Success.
[00001,07,20:09:36.031] Connection ‘external-normal’ [52.7.113.16:59890, {f4f1d01b-21ee-4cb2-b818-79570baa5bc2}] closed: Success.
[00001,27,20:09:36.031] Connection ‘internal-normal’ [10.243.4.120:50437, {572717b1-13b8-482e-8912-2aa2e4c8941e}] closed: Success.
[00001,11,20:09:36.032] Looks like node [10.243.4.120:1112] is DEAD (TCP connection lost).
[00001,24,20:09:36.036] External TCP connection accepted: [Normal, 52.7.113.16:59891, L10.243.0.211:1113, {a14bd876-d5a8-48ee-a278-1a97a270cc30}].
[00001,11,20:09:36.036] CLUSTER HAS CHANGED (TCP connection lost to [10.243.4.120:1112])
Old:
VND {3e976baa-ed36-42dd-bc92-0e35396f1857} [Slave, 10.243.8.160:1112, n/a, 0.0.0.0:1113, n/a, 10.243.8.160:2112, 0.0.0.0:2113] 184833770652/184836185263/184836185263/E45@184742138273:{744ac94c-bfc2-4564-93cb-91d3fb408883} | 2015-06-12 20:09:32.830
VND {5230bdf1-2dce-457b-b891-0158f6cde4e4} [Slave, 10.243.4.120:1112, n/a, 0.0.0.0:1113, n/a, 10.243.4.120:2112, 0.0.0.0:2113] 184833770652/184836536655/184836536655/E45@184742138273:{744ac94c-bfc2-4564-93cb-91d3fb408883} | 2015-06-12 20:09:32.261
VND {6bd8270e-1bd7-4957-bbbd-7e22e02e71c1} [Master, 10.243.0.211:1112, n/a, 0.0.0.0:1113, n/a, 10.243.0.211:2112, 0.0.0.0:2113] 184836929759/184836930317/184836930317/E45@184742138273:{744ac94c-bfc2-4564-93cb-91d3fb408883} | 2015-06-12 20:09:36.032
New:
VND {3e976baa-ed36-42dd-bc92-0e35396f1857} [Slave, 10.243.8.160:1112, n/a, 0.0.0.0:1113, n/a, 10.243.8.160:2112, 0.0.0.0:2113] 184833770652/184836185263/184836185263/E45@184742138273:{744ac94c-bfc2-4564-93cb-91d3fb408883} | 2015-06-12 20:09:32.830
VND {5230bdf1-2dce-457b-b891-0158f6cde4e4} [Slave, 10.243.4.120:1112, n/a, 0.0.0.0:1113, n/a, 10.243.4.120:2112, 0.0.0.0:2113] 184833770652/184836536655/184836536655/E45@184742138273:{744ac94c-bfc2-4564-93cb-91d3fb408883} | 2015-06-12 20:09:36.036
VND {6bd8270e-1bd7-4957-bbbd-7e22e02e71c1} [Master, 10.243.0.211:1112, n/a, 0.0.0.0:1113, n/a, 10.243.0.211:2112, 0.0.0.0:2113] 184836929759/184836930317/184836930317/E45@184742138273:{744ac94c-bfc2-4564-93cb-91d3fb408883} | 2015-06-12 20:09:36.032

Greg_Young1 · June 13, 2015, 1:05pm

This is not a crash just reading the log explains what happened. Truncation needed. This will always happen on a master who is deposed.

I see however a “very slow message” of 4 seconds! To process…

Sam_Kanai · June 13, 2015, 2:01pm

Sure, we understand that truncation is happening. But do you have any idea why it is so frequent? This is a direct blocker to adoption, and we need to solve, mitigate, or else give up on GES.

We have multiple engineers on this, so if you have any suggestions, we’ll do the legwork. But we are stuck right now.