Corrupted Chunk / Missing Epochs (3.0.5)

Simonas_Stundzia · June 26, 2015, 8:58am

Hi,

We are running an EventStore 3.0.5 (just upgraded from 3.0.3) cluster of 3 nodes on CentOS Linux release 7.1.1503 (Core).

Our subscriber service (.Net EventStore.Client NuGet version 3.0.2) started dropping subscriptions with error:

EventStore.ClientAPI.Exceptions.ServerErrorException: Unexpected error on server: Log record at actual pos 225498550 has too large length: 1701474162 bytes, while limit is 16777216 bytes. In chunk #0-0 (chunk-000000.000000).

Same error was logged in Master node of EventStore cluster:

[21350,21,22:58:26.739] Error during processing ReadAllEventsForward request.

Log record at actual pos 225498550 has too large length: 1701474162 bytes, while limit is 16777216 bytes. In chunk #0-0 (chunk-000000.000000).

As an experiment we tried to stop master node (a different node has elected as master), delete it’s whole db and start again.

The node somewhat caught up until it got stuck in PreReplica state with the following error in logs:

[09851,11,08:01:32.877] ES TcpConnection closed [08:01:32.877: N10.2.40.41:39611, L10.2.40.42:1112, {871c122a-0be2-4687-88ad-0b302dbfd342}]:

Received bytes: 674, Sent bytes: 22

Send calls: 2, callbacks: 1

Receive calls: 2, callbacks: 1

Close reason: [Success] Exception during subscribing replica. Connection will be dropped. Error: Replica [10.2.40.41:1112,S:3867ccf5-c433-47d1-9b51-f3c65fd60497,119008548(0x717ED24),epochs:

E1402@118806688:{5fd61dc7-122f-4973-b0d2-13e372f8ede5}

E1401@118790980:{c8a2922b-9b26-4697-9701-4b04a4b4e35d}

E1400@118246569:{e865ca80-909f-44f6-9be8-bbdb084f5d1c}

E1399@118246341:{c015b468-ab19-4e87-8c60-49a6927b5ef9}

E1398@116214968:{6a22d244-6c58-4d7a-a717-6392c4af6975}

E1397@116214741:{8b10cebc-e132-4e79-af23-dcb5c8781af9}

E1396@113888496:{f0be5126-1dc3-4cf2-bbf8-754787d73b9a}

E1395@113888268:{ecec6fb9-a2ec-4d2e-8570-15dc60738002}

E1394@113888041:{5f1ff44f-aeda-4c6c-af15-3410b5a82ede}

E1393@113872303:{1bc40fd8-e4a2-494a-bca9-de67d1b025dc}

E1392@113872075:{89de292b-88dd-43fa-b316-3600f93d1594}

E1391@111656547:{d7554fd0-81a5-4cd5-9e03-4ce4019f8425}

E1390@111361894:{5f351896-9794-41ee-9365-8a0c5ef5fbc6}

E1389@111237794:{8e37d9ac-7faf-4512-b6a7-0d15a118a6c0}

E1388@111237567:{c100849e-1759-4aff-9753-606838e523cf}

E1387@110122473:{b91cd5f3-14f5-49e9-bf37-eff3c7b966db}

E1386@108659276:{a7e6c2a3-1782-49b4-b034-6b6863833493}

E1385@108657250:{54ce105b-452d-46ea-9528-683a1bbdd0b8}

E1384@108656889:{8c063edb-cd82-4911-be98-9f08c133fc28}

E1383@108640192:{10205f82-c311-433e-a088-808560e1265e}]

provided epochs which are not in EpochManager (possibly too old, known epochs:

E3695@264671158:{87f648e0-6367-4aeb-9b5d-83751d5a24a7}

E3694@264670798:{fec16e1d-b2de-4876-b277-1c4bceed559a}

E3693@264670437:{a44c1ce0-57bc-4a87-9225-44037ac526d1}

E3692@264669649:{89e8f4c8-5fd9-470c-8f82-f9e74b59a486}

E3691@264668994:{dc1177cd-96cf-4c72-8a9e-bf829051864e}

E3690@264668300:{e84a41e3-980d-44b4-aa12-5b7ffd58ac1c}

…

DB Verification is enabled,

Hosted on VMs with virtual drives (/dev/vda), not sure about write cache (hdparm util reports “HDIO_DRIVE_CMD(identify) failed: Inappropriate ioctl for device”).

VMs have not been brought down, but may have some network issues.

This is not a first time this has happened for the same cluster (we have 3 other clusters for other 3 environments but those seem to be stable). Last time (on 3.0.3 version) we wiped all 3 nodes.

At the current stage of development data loss is irrelevant, but we want a stable environment.

Can anyone give suggestions/ideas/hints why this is happening and how to fix this?

P.S.: we have backups of all three dbs and logs both before and after wipe should they be required.

Greg_Young1 · June 26, 2015, 10:52am

So this looks like the node came up with a db of its own and tried to
join the cluster. Basically the error message is saying "when the node
came up we couldn't find any matching places in the log". Is it
possible the node was brought up and created its own db as opposed to
replicating from another node?

Simonas_Stundzia · June 26, 2015, 11:20am

Thanks for a reply Greg.

I just realized that I should have formulated this as a two-part question.

What could have caused the “Log record at actual pos 225498550 has too large length:…” error for subscriber? Other threads seem to suggest corrupted chunk, but VM’s were alive. Could it have been caused by an upgrade of EventStore 3.0.3 -> 3.0.5?
After deleting the database for the (then) master node the node did not manage to sync. I’m assuming your reply was about this part. Is there a way to check for your scenario? at the time that node was brought back up, other two nodes were alive and in “Master/Slave” states.

Greg_Young1 · June 26, 2015, 11:24am

1. What could have caused the "Log record at actual pos 225498550 has
too large length:..." error for subscriber? Other threads seem to
suggest corrupted chunk, but VM's were alive. Could it have been
caused by an upgrade of EventStore 3.0.3 -> 3.0.5?

This should be unrelated. Dev on the other hand has some changes that
could be involved. There was also an issue quite some time ago (I want
to say 3.0.1 without looking at commit logs) that could cause
something like this.

2. After deleting the database for the (then) master node the node did
not manage to sync. I'm assuming your reply was about this part. Is
there a way to check for your scenario? at the time that node was
brought back up, other two nodes were alive and in "Master/Slave"
states.

A database should be able to come up and join providing its an empty
database. If however it actually has a database (let's say by starting
it up first and letting it create one) then it will get this exact
error. You should however be able to bring up a node with an empty
database and let it join into the cluster.

Raimondas_Tijunaitis · June 26, 2015, 3:33pm

We encountered similar errors when reading streams from invalid position. Try to read from the beginning e.g. 0/0 and see if it actually is corrupted.

Greg_Young1 · June 26, 2015, 3:39pm

Yes this is the most common cause (read from all with a bad checkpoint
passed in from client code)

Greg_Young1 · June 26, 2015, 3:41pm

Also can you give the full error here?

"1. What could have caused the "Log record at actual pos 225498550 has
too large length:..." error for subscriber? Other threads seem to
suggest corrupted chunk, but VM's were alive. Could it have been
caused by an upgrade of EventStore 3.0.3 -> 3.0.5?"

Paulius_Kasiulis · June 29, 2015, 2:20pm

@Raimondas

Thank you for suggestion. We have just tried to start reading stream from 0/0 positions. Everything seemed OK. Subscriber did not report any errors.

@Greg

A database should be able to come up and join providing its an empty
database. If however it actually has a database (let’s say by starting
it up first and letting it create one) then it will get this exact
error. You should however be able to bring up a node with an empty
database and let it join into the cluster.

We have tried to wipe one node once again but it did not help. This time we made sure that other two nodes are alive. For some time it was catching up but then it got stuck in PreReplica state requesting epochs that other nodes did not have.

Also can you give the full error here?

Here is a fraction of log file. I have highlighted few interesting log entries. Also I have attached full log file of that day. I am not sure if this is relevant anymore since subscriber caught up successfully after reseting positions to 0/0.

[21350,10,22:58:25.858] ELECTIONS: STARTING ELECTIONS.

[21350,10,22:58:25.859] ELECTIONS: (V=112) SHIFT TO LEADER ELECTION.

[21350,10,22:58:25.859] ELECTIONS: (V=112) VIEWCHANGE FROM [10.2.40.41:2112, {f83dfb62-b43e-4a6f-a54b-2c697e479166}].

[21350,10,22:58:25.871] ELECTIONS: (V=112) PREPARE FROM [10.2.40.42:2112, {4e1d20c4-000d-46eb-a4b6-db9ca9f805aa}].

[21350,07,22:58:26.723] SLOW QUEUE MSG [StorageWriterQueue]: WritePrepares - 871ms. Q: 3/4.

[21350,10,22:58:26.723] ELECTIONS: (V=112) SHIFT TO REG_NONLEADER.

[21350,10,22:58:26.724] SLOW BUS MSG [MainBus]: Prepare - 852ms. Handler: ElectionsService.

[21350,10,22:58:26.724] SLOW QUEUE MSG [MainQueue]: Prepare - 853ms. Q: 0/5.

[21350,10,22:58:26.725] ELECTIONS: (V=113) VIEWCHANGE FROM [10.2.40.42:2112, {4e1d20c4-000d-46eb-a4b6-db9ca9f805aa}].

[21350,10,22:58:26.725] ELECTIONS: (V=113) SHIFT TO LEADER ELECTION.

[21350,10,22:58:26.725] ELECTIONS: (V=113) VIEWCHANGE FROM [10.2.40.41:2112, {f83dfb62-b43e-4a6f-a54b-2c697e479166}].

[21350,10,22:58:26.727] ELECTIONS: (V=113) MAJORITY OF VIEWCHANGE.

[21350,10,22:58:26.727] ELECTIONS: (V=113) SHIFT TO PREPARE PHASE.

[21350,10,22:58:26.728] ELECTIONS: (V=113) PREPARE_OK FROM 10.2.40.41:2112,{f83dfb62-b43e-4a6f-a54b-2c697e479166}.

[21350,10,22:58:26.734] ELECTIONS: (V=113) PREPARE_OK FROM 10.2.40.42:2112,{4e1d20c4-000d-46eb-a4b6-db9ca9f805aa}.

[21350,10,22:58:26.734] ELECTIONS: (V=113) SHIFT TO REG_LEADER.

[21350,10,22:58:26.734] ELECTIONS: (V=113) SENDING PROPOSAL CANDIDATE: 10.2.40.41:2112,{f83dfb62-b43e-4a6f-a54b-2c697e479166}, ME: 10.2.40.41:2112,{f83dfb62-b43e-4a6f-a54b-2c697e479166}.

[21350,10,22:58:26.735] ELECTIONS: (V=113) ACCEPT FROM [10.2.40.41:2112,{f83dfb62-b43e-4a6f-a54b-2c697e479166}] M=[10.2.40.41:2112,{f83dfb62-b43e-4a6f-a54b-2c697e479166}]).

[21350,21,22:58:26.739] Error during processing ReadAllEventsForward request.

Log record at actual pos 225498550 has too large length: 1701474162 bytes, while limit is 16777216 bytes. In chunk #0-0 (chunk-000000.000000).

[21350,23,22:58:26.739] Internal TCP connection accepted: [Normal, 10.2.40.48:32820, L10.2.40.41:1112, {4b4cbd74-d68b-4e13-930e-4df67c0ccd8d}].

[21350,10,22:58:26.740] ELECTIONS: (V=113) ACCEPT FROM [10.2.40.48:2112,{e73a2c0e-c470-4a68-aefb-b717af2070d7}] M=[10.2.40.41:2112,{f83dfb62-b43e-4a6f-a54b-2c697e479166}]).

[21350,21,22:58:26.743] SLOW QUEUE MSG [StorageReaderQueue #1]: ReadAllEventsForward - 889ms. Q: 0/0.

[21350,10,22:58:26.743] ELECTIONS: (V=113) DONE. ELECTED MASTER = 10.2.40.41:2112,{f83dfb62-b43e-4a6f-a54b-2c697e479166}. ME=10.2.40.41:2112,{f83dfb62-b43e-4a6f-a54b-2c697e479166}.

[21350,10,22:58:26.743] ELECTIONS: (V=113) ACCEPT FROM [10.2.40.42:2112,{4e1d20c4-000d-46eb-a4b6-db9ca9f805aa}] M=[10.2.40.41:2112,{f83dfb62-b43e-4a6f-a54b-2c697e479166}]).

[21350,10,22:58:26.744] ========== [10.2.40.41:2112] PRE-MASTER STATE, WAITING FOR CHASER TO CATCH UP…

[21350,10,22:58:27.106] ========== [10.2.40.41:2112] IS MASTER!!! SPARTA!!!

[21350,07,22:58:27.106] === Writing E2826@248452405:{3b38f4c9-3612-43ca-a69c-a23aeebd2ec2} (previous epoch at 248452177).

[21350,10,22:58:27.106] There are FEW MASTERS according to gossip, need to start elections. MASTER: [InstanceId: {f83dfb62-b43e-4a6f-a54b-2c697e479166}, InternalTcp: 10.2.40.41:1112, InternalSecureTcp: , ExternalTcp: 10.2.40.41:1113, ExternalSecureTcp: , InternalHttp: 10.2.40.41:2112, ExternalHttp: 10.2.40.41:2113].

GOSSIP:

VND {e73a2c0e-c470-4a68-aefb-b717af2070d7} [Master, 10.2.40.48:1112, n/a, 10.2.40.48:1113, n/a, 10.2.40.48:2112, 10.2.40.48:2113] 248451949/248482985/248467475/E2823@241474092:{535b2247-811d-4fc0-8b40-76d44f832c74} | 2015-06-25 22:58:26.508

VND {4e1d20c4-000d-46eb-a4b6-db9ca9f805aa} [PreReplica, 10.2.40.42:1112, n/a, 10.2.40.42:1113, n/a, 10.2.40.42:2112, 10.2.40.42:2113] 248436442/248451949/248451949/E2823@241474092:{535b2247-811d-4fc0-8b40-76d44f832c74} | 2015-06-25 22:58:26.861

VND {f83dfb62-b43e-4a6f-a54b-2c697e479166} [Master, 10.2.40.41:1112, n/a, 10.2.40.41:1113, n/a, 10.2.40.41:2112, 10.2.40.41:2113] 248436442/248452405/248452405/E2825@248452177:{9314b0b8-3ee1-46d1-bc09-a8dd77f6501b} | 2015-06-25 22:58:27.106

[21350,10,22:58:27.108] ELECTIONS: STARTING ELECTIONS.

[21350,10,22:58:27.108] ELECTIONS: (V=114) SHIFT TO LEADER ELECTION.

[21350,10,22:58:27.109] ELECTIONS: (V=114) VIEWCHANGE FROM [10.2.40.41:2112, {f83dfb62-b43e-4a6f-a54b-2c697e479166}].

[21350,10,22:58:27.115] ELECTIONS: (V=114) VIEWCHANGE FROM [10.2.40.42:2112, {4e1d20c4-000d-46eb-a4b6-db9ca9f805aa}].

[21350,10,22:58:27.115] ELECTIONS: (V=114) MAJORITY OF VIEWCHANGE.

[21350,10,22:58:27.119] ELECTIONS: (V=114) VIEWCHANGE FROM [10.2.40.48:2112, {e73a2c0e-c470-4a68-aefb-b717af2070d7}].

[21350,10,22:58:27.119] ELECTIONS: (V=114) PREPARE FROM [10.2.40.48:2112, {e73a2c0e-c470-4a68-aefb-b717af2070d7}].

[21350,10,22:58:27.119] ELECTIONS: (V=114) SHIFT TO REG_NONLEADER.

[21350,07,22:58:31.664] === Update Last Epoch E2826@248452405:{3b38f4c9-3612-43ca-a69c-a23aeebd2ec2} (previous epoch at 248452177).

[21350,07,22:58:31.665] SLOW QUEUE MSG [StorageWriterQueue]: BecomeMaster - 4558ms. Q: 0/4.

[21350,10,22:58:31.665] SLOW BUS MSG [MainBus]: Prepare - 4544ms. Handler: ElectionsService.

[21350,10,22:58:31.665] SLOW QUEUE MSG [MainQueue]: Prepare - 4545ms. Q: 0/45.

[21350,10,22:58:31.665] ELECTIONS: (V=114) PROPOSAL FROM [10.2.40.48:2112,{e73a2c0e-c470-4a68-aefb-b717af2070d7}] M=10.2.40.41:2112,{f83dfb62-b43e-4a6f-a54b-2c697e479166}. ME=10.2.40.41:2112,{f83dfb62-b43e-4a6f-a54b-2c697e479166}.

[21350,10,22:58:31.666] ELECTIONS: (V=114) ACCEPT FROM [10.2.40.48:2112,{e73a2c0e-c470-4a68-aefb-b717af2070d7}] M=[10.2.40.41:2112,{f83dfb62-b43e-4a6f-a54b-2c697e479166}]).

[21350,10,22:58:31.666] ELECTIONS: (V=114) ACCEPT FROM [10.2.40.41:2112,{f83dfb62-b43e-4a6f-a54b-2c697e479166}] M=[10.2.40.41:2112,{f83dfb62-b43e-4a6f-a54b-2c697e479166}]).

[21350,10,22:58:31.666] ELECTIONS: (V=114) DONE. ELECTED MASTER = 10.2.40.41:2112,{f83dfb62-b43e-4a6f-a54b-2c697e479166}. ME=10.2.40.41:2112,{f83dfb62-b43e-4a6f-a54b-2c697e479166}.

[21350,10,22:58:31.668] ELECTIONS: (V=114) ACCEPT FROM [10.2.40.42:2112,{4e1d20c4-000d-46eb-a4b6-db9ca9f805aa}] M=[10.2.40.41:2112,{f83dfb62-b43e-4a6f-a54b-2c697e479166}]).

[21350,10,22:58:31.668] SUBSCRIBE REQUEST from [10.2.40.42:1112,C:{bedf5f67-a1c3-4db2-b298-c207e417de07},S:{60555ee1-e79e-47c9-b2de-f467c63f39fd},248451949(0xECF136D),E2823@241474092:{535b2247-811d-4fc0-8b40-76d44f832c74}, E2822@233693566:{78e2432c-c6f5-4dbd-b0e2-b6213d99a9b2}, E2821@233672202:{db7c4656-c379-4851-a771-d629c895e862}, E2820@233671974:{23d90da0-9d29-4c44-8fcd-ac16f39a49cd}, E2819@233671281:{cbc94a1f-e6d7-46f1-b9d3-6b2c43a93b94}, E2818@233035997:{53a25250-9a95-48ac-8095-001e514dca0f}, E2817@232556112:{43480f3c-12df-4290-aedc-a72e673649d1}, E2816@232540412:{ef9363bd-f7c0-4e89-89e7-596f5a29a3ff}, E2815@232540184:{5b96f231-b6fe-4747-99f9-6eebe86f0f24}, E2814@232524443:{16b09ccf-0af2-48eb-be95-2d4db52e97bb}, E2813@232425374:{38f52dd9-258b-41f4-8b9a-8315a132b903}, E2812@232357132:{0eea0f4f-56cc-43ff-b4a1-9b378fa29597}, E2811@230965221:{0de90a8b-99fb-4baf-a74c-ecce9472d054}, E2810@230206919:{ecdbead5-8b38-4695-bbd5-ffd9baabc90f}, E2809@229881481:{b4fc212d-9948-4f05-98d3-1a92659c3660}, E2808@226095054:{2604677b-413d-4355-9f63-c452112bf27a}, E2807@226094826:{9b239fcf-2330-4f9a-b984-e680bc0db85f}, E2806@226017464:{d668618e-4f9a-4aa3-97c9-3c0448cee926}, E2805@225847482:{22a50963-a88d-4bc8-85ca-9c6a82dd57cc}, E2804@225569437:{b641e747-8021-4863-88c2-36c886c91544}]…

[21350,10,22:58:31.669] Subscribed replica [10.2.40.42:1112,S:60555ee1-e79e-47c9-b2de-f467c63f39fd] for data send at 248451949 (0xECF136D).

[21350,10,22:58:31.670] CLUSTER HAS CHANGED (gossip received from [10.2.40.42:2112])

Old:

VND {e73a2c0e-c470-4a68-aefb-b717af2070d7} [Master, 10.2.40.48:1112, n/a, 10.2.40.48:1113, n/a, 10.2.40.48:2112, 10.2.40.48:2113] 248451949/248482985/248467475/E2823@241474092:{535b2247-811d-4fc0-8b40-76d44f832c74} | 2015-06-25 22:58:26.508

VND {4e1d20c4-000d-46eb-a4b6-db9ca9f805aa} [PreReplica, 10.2.40.42:1112, n/a, 10.2.40.42:1113, n/a, 10.2.40.42:2112, 10.2.40.42:2113] 248436442/248451949/248451949/E2823@241474092:{535b2247-811d-4fc0-8b40-76d44f832c74} | 2015-06-25 22:58:26.861

VND {f83dfb62-b43e-4a6f-a54b-2c697e479166} [Master, 10.2.40.41:1112, n/a, 10.2.40.41:1113, n/a, 10.2.40.41:2112, 10.2.40.41:2113] 248436442/248452405/248452405/E2825@248452177:{9314b0b8-3ee1-46d1-bc09-a8dd77f6501b} | 2015-06-25 22:58:27.106

New:

VND {e73a2c0e-c470-4a68-aefb-b717af2070d7} [PreReplica, 10.2.40.48:1112, n/a, 10.2.40.48:1113, n/a, 10.2.40.48:2112, 10.2.40.48:2113] 248467475/248482985/248482985/E2823@241474092:{535b2247-811d-4fc0-8b40-76d44f832c74} | 2015-06-25 22:58:27.392

VND {4e1d20c4-000d-46eb-a4b6-db9ca9f805aa} [PreReplica, 10.2.40.42:1112, n/a, 10.2.40.42:1113, n/a, 10.2.40.42:2112, 10.2.40.42:2113] 248436442/248451949/248451949/E2823@241474092:{535b2247-811d-4fc0-8b40-76d44f832c74} | 2015-06-25 22:58:27.506

VND {f83dfb62-b43e-4a6f-a54b-2c697e479166} [Master, 10.2.40.41:1112, n/a, 10.2.40.41:1113, n/a, 10.2.40.41:2112, 10.2.40.41:2113] 248436442/248452633/248452633/E2826@248452405:{3b38f4c9-3612-43ca-a69c-a23aeebd2ec2} | 2015-06-25 22:58:31.669

``

I have also attached a full log file of the node.

eventstore.out-20150626.zip (372 KB)

Greg_Young1 · June 29, 2015, 2:33pm

Maybe I am missing something that looked like the nodes continued
happily in a cluster?

Like you end here:

VND {e73a2c0e-c470-4a68-aefb-b717af2070d7} <LIVE> [Master,
10.2.40.48:1112, n/a, 10.2.40.48:1113, n/a, 10.2.40.48:2112,
10.2.40.48:2113]
248451949/248482985/248467475/E2823@241474092:{535b2247-811d-4fc0-8b40-76d44f832c74}

2015-06-25 22:58:26.508

VND {4e1d20c4-000d-46eb-a4b6-db9ca9f805aa} <LIVE> [PreReplica,
10.2.40.42:1112, n/a, 10.2.40.42:1113, n/a, 10.2.40.42:2112,
10.2.40.42:2113]
248436442/248451949/248451949/E2823@241474092:{535b2247-811d-4fc0-8b40-76d44f832c74}

2015-06-25 22:58:26.861

VND {f83dfb62-b43e-4a6f-a54b-2c697e479166} <LIVE> [Master,
10.2.40.41:1112, n/a, 10.2.40.41:1113, n/a, 10.2.40.41:2112,
10.2.40.41:2113]
248436442/248452405/248452405/E2825@248452177:{9314b0b8-3ee1-46d1-bc09-a8dd77f6501b}

2015-06-25 22:58:27.106

New:
VND {e73a2c0e-c470-4a68-aefb-b717af2070d7} <LIVE> [PreReplica,
10.2.40.48:1112, n/a, 10.2.40.48:1113, n/a, 10.2.40.48:2112,
10.2.40.48:2113]
248467475/248482985/248482985/E2823@241474092:{535b2247-811d-4fc0-8b40-76d44f832c74}

2015-06-25 22:58:27.392

VND {4e1d20c4-000d-46eb-a4b6-db9ca9f805aa} <LIVE> [PreReplica,
10.2.40.42:1112, n/a, 10.2.40.42:1113, n/a, 10.2.40.42:2112,
10.2.40.42:2113]
248436442/248451949/248451949/E2823@241474092:{535b2247-811d-4fc0-8b40-76d44f832c74}

2015-06-25 22:58:27.506

VND {f83dfb62-b43e-4a6f-a54b-2c697e479166} <LIVE> [Master,
10.2.40.41:1112, n/a, 10.2.40.41:1113, n/a, 10.2.40.41:2112,
10.2.40.41:2113]
248436442/248452633/248452633/E2826@248452405:{3b38f4c9-3612-43ca-a69c-a23aeebd2ec2}

2015-06-25 22:58:31.669

These messages are all with like a second of each other. My guess is
the master then promoted the pre-reps to slaves?

New:
VND {c57bf060-2697-497a-b3e7-80d8961aeab0} <LIVE> [Slave,
10.2.40.48:1112, n/a, 10.2.40.48:1113, n/a, 10.2.40.48:2112,
10.2.40.48:2113]
263460081/263475649/263475649/E2860@262621472:{584be5bb-e211-401f-af94-e089f06f1b2d}

2015-06-26 07:16:29.075

VND {49b7540b-3037-4305-bbca-07d079b4483c} <DEAD> [Slave,
10.2.40.42:1112, n/a, 10.2.40.42:1113, n/a, 10.2.40.42:2112,
10.2.40.42:2113]
263460081/263475649/263475649/E2860@262621472:{584be5bb-e211-401f-af94-e089f06f1b2d}

2015-06-26 07:16:30.905

VND {f83dfb62-b43e-4a6f-a54b-2c697e479166} <LIVE> [Master,
10.2.40.41:1112, n/a, 10.2.40.41:1113, n/a, 10.2.40.41:2112,
10.2.40.41:2113]
263460081/263475649/263475649/E2860@262621472:{584be5bb-e211-401f-af94-e089f06f1b2d}

Paulius_Kasiulis · June 30, 2015, 8:20am

I am sorry, we must have miscommunicated. The error logs that I have provided was a reply to:

Also can you give the full error here?
“1. What could have caused the “Log record at actual pos 225498550 has
too large length:…” error for subscriber? Other threads seem to
suggest corrupted chunk, but VM’s were alive. Could it have been
caused by an upgrade of EventStore 3.0.3 -> 3.0.5?”

We have reset our subscriber checkpoint and now the error “Log record at actual pos 225498550 has too large length:…” does not appear anymore. I think this is not relevant anymore.

However the issue that we are encountering now is that after wiping single node it does not catch up to the other nodes. It catches up to 1501 epochs and then gets stuck in PreReplica state. We have made sure that other nodes are alive during the wiping process.

10.2.40.41 M ----> Wipe ----> PreReplica

10.2.40.42 S ----> M ----> M

10.2.40.48 S ----> S ----> S

``

In wiped node logs we can see these lines. They are repeating each second. I assume this node is polling these epochs from master each second but fails to do so?

[08185,13,07:53:03.363] === Update Last Epoch E1501@132192803:{65ff754e-0e72-4e88-9305-3f57da403b57} (previous epoch at 132192576).

[08185,10,07:53:03.439] Subscribing at LogPosition: 132642553 (0x7E7F6F9) to MASTER [10.2.40.42:1112, {5afde9e9-e6e7-4249-8b34-1f49765ff917}] as replica with SubscriptionId: {591ccae3-806b-4f94-8e58-588049bd0e15}, ConnectionId: {66617732-6a90-46bb-9e0a-8f4ff779e8d4}, LocalEndPoint: [10.2.40.41:46669], Epochs:

E1501@132192803:{65ff754e-0e72-4e88-9305-3f57da403b57}

E1500@132192576:{ce0eabb4-d454-4141-be86-90f204840ddc}

E1499@132192348:{bb45449f-c8c1-4380-87d9-c0662d385713}

E1498@132192120:{a489f820-2e2e-4656-8c40-f0d5d57552b1}

E1497@132191893:{e2458136-35dd-4aef-94b5-bc8ef6a97efb}

E1496@132145160:{3864c62a-2aab-482f-b547-a3d1117a7637}

E1495@132128033:{c895a1a7-50f3-4655-932e-a0df5b96a3fa}

E1494@132127339:{cf97557b-b2cd-43f2-9c30-5fea498522e7}

E1493@132126818:{9e385852-ef57-4609-964b-61c5e1a0754e}

E1492@132126124:{f4206f54-db30-4e96-8f8d-1969df14c233}

E1491@132125896:{82c5b692-b1c1-4550-83bd-04b163e28022}

E1490@132125668:{4da3bfba-7b87-4b52-896d-623acf90abe3}

E1489@132109920:{099ce90a-2ef4-406b-98f2-5ef301d15498}

E1488@131024740:{5184d331-062e-43ae-9fb1-c3a3634aac9a}

E1487@131024513:{7fdee16c-45cf-4c3d-a56d-460f1319d08b}

E1486@131022313:{550dca76-2cdd-45e5-9f4d-e5dddee623e8}

E1485@131022085:{c19e6f69-a2c6-4a49-a815-b75dc1caa3ce}

E1484@131006371:{56dc0bb7-b516-492e-8859-de518ff9f1ca}

E1483@131006143:{b55fb8d1-25cd-48d0-8cd1-f0e5494dcc24}

E1482@131005915:{5f947130-fb30-4867-8310-46ce95a1ee6e}…

``

In the master node log file we can see this error at the exact same time.

[25494,11,07:53:03.443] SUBSCRIBE REQUEST from [10.2.40.41:1112,C:{7ca025d8-07bf-478c-bd34-2523f51b11bb},S:{591ccae3-806b-4f94-8e58-588049bd0e15},132642553(0x7E7F6F9),E1501@132192803:{65ff754e-0e72-4e88-9305-3f57da403b57}, E1500@132192576:{ce0eabb4-d454-4141-be86-90f204840ddc}, E1499@132192348:{bb45449f-c8c1-4380-87d9-c0662d385713}, E1498@132192120:{a489f820-2e2e-4656-8c40-f0d5d57552b1}, E1497@132191893:{e2458136-35dd-4aef-94b5-bc8ef6a97efb}, E1496@132145160:{3864c62a-2aab-482f-b547-a3d1117a7637}, E1495@132128033:{c895a1a7-50f3-4655-932e-a0df5b96a3fa}, E1494@132127339:{cf97557b-b2cd-43f2-9c30-5fea498522e7}, E1493@132126818:{9e385852-ef57-4609-964b-61c5e1a0754e}, E1492@132126124:{f4206f54-db30-4e96-8f8d-1969df14c233}, E1491@132125896:{82c5b692-b1c1-4550-83bd-04b163e28022}, E1490@132125668:{4da3bfba-7b87-4b52-896d-623acf90abe3}, E1489@132109920:{099ce90a-2ef4-406b-98f2-5ef301d15498}, E1488@131024740:{5184d331-062e-43ae-9fb1-c3a3634aac9a}, E1487@131024513:{7fdee16c-45cf-4c3d-a56d-460f1319d08b}, E1486@131022313:{550dca76-2cdd-45e5-9f4d-e5dddee623e8}, E1485@131022085:{c19e6f69-a2c6-4a49-a815-b75dc1caa3ce}, E1484@131006371:{56dc0bb7-b516-492e-8859-de518ff9f1ca}, E1483@131006143:{b55fb8d1-25cd-48d0-8cd1-f0e5494dcc24}, E1482@131005915:{5f947130-fb30-4867-8310-46ce95a1ee6e}]…

[25494,11,07:53:03.456] Replica [10.2.40.41:1112,S:591ccae3-806b-4f94-8e58-588049bd0e15,132642553(0x7E7F6F9),epochs:

E1501@132192803:{65ff754e-0e72-4e88-9305-3f57da403b57}

E1500@132192576:{ce0eabb4-d454-4141-be86-90f204840ddc}

E1499@132192348:{bb45449f-c8c1-4380-87d9-c0662d385713}

E1498@132192120:{a489f820-2e2e-4656-8c40-f0d5d57552b1}

E1497@132191893:{e2458136-35dd-4aef-94b5-bc8ef6a97efb}

E1496@132145160:{3864c62a-2aab-482f-b547-a3d1117a7637}

E1495@132128033:{c895a1a7-50f3-4655-932e-a0df5b96a3fa}

E1494@132127339:{cf97557b-b2cd-43f2-9c30-5fea498522e7}

E1493@132126818:{9e385852-ef57-4609-964b-61c5e1a0754e}

E1492@132126124:{f4206f54-db30-4e96-8f8d-1969df14c233}

E1491@132125896:{82c5b692-b1c1-4550-83bd-04b163e28022}

E1490@132125668:{4da3bfba-7b87-4b52-896d-623acf90abe3}

E1489@132109920:{099ce90a-2ef4-406b-98f2-5ef301d15498}

E1488@131024740:{5184d331-062e-43ae-9fb1-c3a3634aac9a}

E1487@131024513:{7fdee16c-45cf-4c3d-a56d-460f1319d08b}

E1486@131022313:{550dca76-2cdd-45e5-9f4d-e5dddee623e8}

E1485@131022085:{c19e6f69-a2c6-4a49-a815-b75dc1caa3ce}

E1484@131006371:{56dc0bb7-b516-492e-8859-de518ff9f1ca}

E1483@131006143:{b55fb8d1-25cd-48d0-8cd1-f0e5494dcc24}

E1482@131005915:{5f947130-fb30-4867-8310-46ce95a1ee6e}]

provided epochs which are not in EpochManager (possibly too old, known epochs:

E2864@263823368:{d5d6687d-ad6c-4f8a-a4f5-3d5faa3cdb71}

E2863@263776714:{ccc47045-976b-4dbc-afb5-28fcec97e4c7}

E2862@263497962:{90497c6c-c9af-4266-b23e-7ad823c36343}

E2861@263475649:{58db197e-d566-4513-98e0-3c206678af37}
…
E1869@175450882:{41a34ab7-0d23-4c58-9e40-3db23f4ac57e}

E1868@175433823:{410fa744-b294-4ede-8940-919161c7080e}

E1867@175433461:{69e7787d-74cb-4222-8f82-00c176bb9eb7}

E1866@175417756:{0a606619-90fd-4f49-82c5-1643270784f9}

E1865@175416769:{acb11dba-46b5-4b76-9408-aad17990cebd}).

Master LogPosition: 264133239 (0xFBE5A77). We do not support this case as of now.

CommonEpoch: E1501@132192803:{65ff754e-0e72-4e88-9305-3f57da403b57}, AfterCommonEpoch:

``

Does it mean that these epochs are lost? How do we bring this wiped node back to life?

I have attached log files for all machines. Any help is appreciated.

Paulius

20150630_0930.7z (1.43 MB)