Subscriptions receiving Invalid TCP frame received

We have had this system running for about a year and half, it is a 3 node cluster. We are receiving this in our client logs:

{ “date”: “2016-10-14 11:20:01.9779”, “level”: “Error”, “logger”: “EventStore.ClientAPI.IEventStoreConnection”, “machinename”: “RD0003FFA47995”, “processname”: “WebJobTransactionSubscription”, “processid”: “34516”, “threadid”: “3”, “message”: “TcpPackageConnection: [104.40.20.177:1113, L100.106.76.88:50043, {1b588cdf-a98c-4de0-b267-5c31ee51bd1a}]. Invalid TCP frame received.”, “exception”: “Package size is out of bounds: 69991366 (max: 67108864). This is likely an exceptionally large message (reading too many things) or there is a problem with the framing if working on a new client.” }

{ “date”: “2016-10-14 11:20:01.9779”, “level”: “Info”, “logger”: “EventStore.ClientAPI.IEventStoreConnection”, “machinename”: “RD0003FFA47995”, “processname”: “WebJobTransactionSubscription”, “processid”: “34516”, “threadid”: “3”, “message”: “ClientAPI TcpConnection closed [11:20:01.977: N104.40.20.177:1113, L100.106.76.88:50043, {1b588cdf-a98c-4de0-b267-5c31ee51bd1a}]:\nReceived bytes: 134858453, Sent bytes: 6319\nSend calls: 147, callbacks: 147\nReceive calls: 17625, callbacks: 17624\nClose reason: [Success] Invalid TCP frame received.\n” }

Our eventstore log:

[PID:01656:041 2016.10.14 11:20:01.704 TRACE InMemoryBus ] SLOW BUS MSG [Worker #1 Bus]: TcpSend - 598ms. Handler: TcpSendService.

[PID:01656:041 2016.10.14 11:20:01.704 TRACE QueuedHandlerThreadP] SLOW QUEUE MSG [Worker #1]: TcpSend - 598ms. Q: 0/3.

[PID:01656:033 2016.10.14 11:20:01.720 INFO TcpConnection ] ES TcpConnection closed [11:20:01.720: N40.83.183.69:61132, L10.208.34.4:1113, {1df3f5aa-60a0-4b00-9e33-4766781d3066}]:Received bytes: 6319, Sent bytes: 204669823

[PID:01656:033 2016.10.14 11:20:01.720 INFO TcpConnection ] ES TcpConnection closed [11:20:01.720: N40.83.183.69:61132, L10.208.34.4:1113, {1df3f5aa-60a0-4b00-9e33-4766781d3066}]:Send calls: 147, callbacks: 147

[PID:01656:033 2016.10.14 11:20:01.720 INFO TcpConnection ] ES TcpConnection closed [11:20:01.720: N40.83.183.69:61132, L10.208.34.4:1113, {1df3f5aa-60a0-4b00-9e33-4766781d3066}]:Receive calls: 148, callbacks: 148

[PID:01656:033 2016.10.14 11:20:01.720 INFO TcpConnection ] ES TcpConnection closed [11:20:01.720: N40.83.183.69:61132, L10.208.34.4:1113, {1df3f5aa-60a0-4b00-9e33-4766781d3066}]:Close reason: [Success] Socket closed

[PID:01656:033 2016.10.14 11:20:01.720 INFO TcpConnectionManager] Connection ‘external-normal’ [40.83.183.69:61132, {1df3f5aa-60a0-4b00-9e33-4766781d3066}] closed: Success.

[PID:01656:023 2016.10.14 11:20:01.720 DEBUG PersistentSubscripti] Lost connection from 40.83.183.69:61132

[PID:01656:038 2016.10.14 11:20:01.954 INFO TcpService ] External TCP connection accepted: [Normal, 40.83.183.69:61133, L10.208.34.4:1113, {a76905e1-f518-4d7d-a46c-1a04d79211ba}].

I also did find one of these in our node error log:

[PID:01656:005 2016.10.14 12:21:08.423 ERROR QueuedHandlerThreadP] —!!! VERY SLOW QUEUE MSG [StorageReaderQueue #3]: ReadStreamEventsForward - 30258ms. Q: 0/16.

[PID:01656:033 2016.10.14 12:21:21.895 ERROR QueuedHandlerThreadP] —!!! VERY SLOW QUEUE MSG [StorageReaderQueue #2]: ReadStreamEventsForward - 12436ms. Q: 0/10.

[PID:01656:006 2016.10.14 13:42:24.997 ERROR LengthPrefixMessageF] FRAMING ERROR! Data:

000000: 47 45 54 20 2F 20 48 54 54 50 2F 31 2E 31 0D 0A | GET / HTTP/1.1…

000016: 48 6F 73 74 3A 20 31 30 34 2E 34 30 2E 32 30 2E | Host: 104.40.20.

000032: 31 37 37 3A 31 31 31 33 0D 0A 43 6F 6E 6E 65 63 | 177:1113…Connec

000048: 74 69 6F 6E 3A 20 6B 65 65 70 2D 61 6C 69 76 65 | tion: keep-alive

000064: 0D 0A 55 70 67 72 61 64 65 2D 49 6E 73 65 63 75 | …Upgrade-Insecu

000080: 72 65 2D 52 65 71 75 65 73 74 73 3A 20 31 0D 0A | re-Requests: 1…

000096: 55 73 65 72 2D 41 67 65 6E 74 3A 20 4D 6F 7A 69 | User-Agent: Mozi

000112: 6C 6C 61 2F 35 2E 30 20 28 57 69 6E 64 6F 77 73 | lla/5.0 (Windows

000128: 20 4E 54 20 31 30 2E 30 3B 20 57 4F 57 36 34 29 | NT 10.0; WOW64)

000144: 20 41 70 70 6C 65 57 65 62 4B 69 74 2F 35 33 37 | AppleWebKit/537

000160: 2E 33 36 20 28 4B 48 54 4D 4C 2C 20 6C 69 6B 65 | .36 (KHTML, like

000176: 20 47 65 63 6B 6F 29 20 43 68 72 6F 6D 65 2F 35 | Gecko) Chrome/5

000192: 33 2E 30 2E 32 37 38 35 2E 31 34 33 20 53 61 66 | 3.0.2785.143 Saf

000208: 61 72 69 2F 35 33 37 2E 33 36 0D 0A 41 63 63 65 | ari/537.36…Acce

000224: 70 74 3A 20 74 65 78 74 2F 68 74 6D 6C 2C 61 70 | pt: text/html,ap

000240: 70 6C 69 63 61 74 69 6F 6E 2F 78 68 74 6D 6C 2B | plication/xhtml+

000256: 78 6D 6C 2C 61 70 70 6C 69 63 61 74 69 6F 6E 2F | xml,application/

000272: 78 6D 6C 3B 71 3D 30 2E 39 2C 69 6D 61 67 65 2F | xml;q=0.9,image/

000288: 77 65 62 70 2C 2A 2F 2A 3B 71 3D 30 2E 38 0D 0A | webp,/;q=0.8…

000304: 41 63 63 65 70 74 2D 45 6E 63 6F 64 69 6E 67 3A | Accept-Encoding:

000320: 20 67 7A 69 70 2C 20 64 65 66 6C 61 74 65 2C 20 | gzip, deflate,

000336: 73 64 63 68 0D 0A 41 63 63 65 70 74 2D 4C 61 6E | sdch…Accept-Lan

000352: 67 75 61 67 65 3A 20 65 6E 2D 55 53 2C 65 6E 3B | guage: en-US,en;

000368: 71 3D 30 2E 38 0D 0A 0D 0A | q=0.8…

[PID:01656:006 2016.10.14 13:42:25.075 ERROR TcpConnectionManager] Closing connection ‘external-normal’ [50.93.103.27:61916, L10.208.34.4:1113, {71b61115-60cd-4963-8528-c701a65ebe34}] due to error. Reason: Invalid TCP frame received. Error: Package size is out of bounds: 542393671 (max: 67108864)…

The Node that is running is reporting this for memory usage:

Image PID Hard Faults/sec Commit (KB) Working Set (KB) Shareable (KB) Private (KB)

EventStore.ClusterNode.exe 1656 0 986,996 780,208 10,740 769,468

This seems to hold up our system and no further processing can occur. We have tried rebuilding the indexes on the nodes, rebooting the machines, and restarting our client webjob. Looking for any advice as this is a production system that has been down a few days now.

Thanks!

we have also seen this on our webjob client

[10/14/2016 20:31:12 > ec938a: INFO] Executing: ‘SubscriptionMaster.MonitorEventStore’ - Reason: ‘Timer fired at 2016-10-14T20:31:11.7051381+00:00’

[10/14/2016 20:31:32 > ec938a: ERR ]

[10/14/2016 20:31:32 > ec938a: ERR ] Unhandled Exception: System.InvalidOperationException: Start may not be called on a promise-style task.

[10/14/2016 20:31:32 > ec938a: ERR ] at System.Threading.Tasks.Task.Start(TaskScheduler scheduler)

[10/14/2016 20:31:32 > ec938a: ERR ] at WebJobTransactionSubscription.SubscriptionMaster.EventStoreConnectionOnClosed(Object sender, ClientClosedEventArgs clientClosedEventArgs) in C:\a\1\s\src\WebJobTransactionSubscription\SubscriptionMaster.cs:line 88

[10/14/2016 20:31:32 > ec938a: ERR ] at System.EventHandler`1.Invoke(Object sender, TEventArgs e)

[10/14/2016 20:31:32 > ec938a: ERR ] at EventStore.ClientAPI.Internal.EventStoreConnectionLogicHandler.RaiseClosed(String reason)

[10/14/2016 20:31:32 > ec938a: ERR ] at EventStore.ClientAPI.Internal.EventStoreConnectionLogicHandler.CloseConnection(String reason, Exception exception)

[10/14/2016 20:31:32 > ec938a: ERR ] at EventStore.ClientAPI.Internal.EventStoreConnectionLogicHandler.TimerTick()

[10/14/2016 20:31:32 > ec938a: ERR ] at EventStore.ClientAPI.Internal.EventStoreConnectionLogicHandler.<.ctor>b__e(TimerTickMessage msg)

[10/14/2016 20:31:32 > ec938a: ERR ] at EventStore.ClientAPI.Internal.SimpleQueuedHandler.<>c__DisplayClass1`1.b__0(Message msg)

[10/14/2016 20:31:32 > ec938a: ERR ] at EventStore.ClientAPI.Internal.SimpleQueuedHandler.ProcessQueue(Object state)

[10/14/2016 20:31:32 > ec938a: ERR ] at System.Threading.QueueUserWorkItemCallback.WaitCallback_Context(Object state)

[10/14/2016 20:31:32 > ec938a: ERR ] at System.Threading.ExecutionContext.RunInternal(ExecutionContext executionContext, ContextCallback callback, Object state, Boolean preserveSyncCtx)

[10/14/2016 20:31:32 > ec938a: ERR ] at System.Threading.ExecutionContext.Run(ExecutionContext executionContext, ContextCallback callback, Object state, Boolean preserveSyncCtx)

[10/14/2016 20:31:32 > ec938a: ERR ] at System.Threading.QueueUserWorkItemCallback.System.Threading.IThreadPoolWorkItem.ExecuteWorkItem()

[10/14/2016 20:31:32 > ec938a: ERR ] at System.Threading.ThreadPoolWorkQueue.Dispatch()

[10/14/2016 20:31:32 > ec938a: ERR ] at System.Threading._ThreadPoolWaitCallback.PerformWaitCallback()

[10/14/2016 20:31:32 > ec938a: SYS ERR ] Job failed due to exit code -532462766

[10/14/2016 20:31:33 > ec938a: SYS INFO] Process went down, waiting for 60 seconds

[PID:01656:006 2016.10.14 13:42:24.997 ERROR LengthPrefixMessageF]
FRAMING ERROR! Data:
000000: 47 45 54 20 2F 20 48 54 54 50 2F 31 2E 31 0D 0A | GET / HTTP/1.1..
000016: 48 6F 73 74 3A 20 31 30 34 2E 34 30 2E 32 30 2E | Host: 104.40.20.
000032: 31 37 37 3A 31 31 31 33 0D 0A 43 6F 6E 6E 65 63 | 177:1113..Connec
000048: 74 69 6F 6E 3A 20 6B 65 65 70 2D 61 6C 69 76 65 | tion: keep-alive
000064: 0D 0A 55 70 67 72 61 64 65 2D 49 6E 73 65 63 75 | ..Upgrade-Insecu
000080: 72 65 2D 52 65 71 75 65 73 74 73 3A 20 31 0D 0A | re-Requests: 1..
000096: 55 73 65 72 2D 41 67 65 6E 74 3A 20 4D 6F 7A 69 | User-Agent: Mozi
000112: 6C 6C 61 2F 35 2E 30 20 28 57 69 6E 64 6F 77 73 | lla/5.0 (Windows
000128: 20 4E 54 20 31 30 2E 30 3B 20 57 4F 57 36 34 29 | NT 10.0; WOW64)
000144: 20 41 70 70 6C 65 57 65 62 4B 69 74 2F 35 33 37 | AppleWebKit/537
000160: 2E 33 36 20 28 4B 48 54 4D 4C 2C 20 6C 69 6B 65 | .36 (KHTML, like
000176: 20 47 65 63 6B 6F 29 20 43 68 72 6F 6D 65 2F 35 | Gecko) Chrome/5
000192: 33 2E 30 2E 32 37 38 35 2E 31 34 33 20 53 61 66 | 3.0.2785.143 Saf
000208: 61 72 69 2F 35 33 37 2E 33 36 0D 0A 41 63 63 65 | ari/537.36..Acce
000224: 70 74 3A 20 74 65 78 74 2F 68 74 6D 6C 2C 61 70 | pt: text/html,ap
000240: 70 6C 69 63 61 74 69 6F 6E 2F 78 68 74 6D 6C 2B | plication/xhtml+
000256: 78 6D 6C 2C 61 70 70 6C 69 63 61 74 69 6F 6E 2F | xml,application/
000272: 78 6D 6C 3B 71 3D 30 2E 39 2C 69 6D 61 67 65 2F | xml;q=0.9,image/
000288: 77 65 62 70 2C 2A 2F 2A 3B 71 3D 30 2E 38 0D 0A | webp,*/*;q=0.8..
000304: 41 63 63 65 70 74 2D 45 6E 63 6F 64 69 6E 67 3A | Accept-Encoding:
000320: 20 67 7A 69 70 2C 20 64 65 66 6C 61 74 65 2C 20 | gzip, deflate,
000336: 73 64 63 68 0D 0A 41 63 63 65 70 74 2D 4C 61 6E | sdch..Accept-Lan
000352: 67 75 61 67 65 3A 20 65 6E 2D 55 53 2C 65 6E 3B | guage: en-US,en;
000368: 71 3D 30 2E 38 0D 0A 0D 0A | q=0.8....

This is someone sending a http request to the TCP port.

{ "date": "2016-10-14 11:20:01.9779", "level": "Error", "logger":
"EventStore.ClientAPI.IEventStoreConnection", "machinename":
"RD0003FFA47995", "processname": "WebJobTransactionSubscription",
"processid": "34516", "threadid": "3", "message":
"TcpPackageConnection: [104.40.20.177:1113, L100.106.76.88:50043,
{1b588cdf-a98c-4de0-b267-5c31ee51bd1a}]. Invalid TCP frame received.",
"exception": "Package size is out of bounds: 69991366 (max: 67108864).
This is likely an exceptionally large message (reading too many
things) or there is a problem with the framing if working on a new
client." }
{ "date": "2016-10-14 11:20:01.9779", "level": "Info", "logger":
"EventStore.ClientAPI.IEventStoreConnection", "machinename":
"RD0003FFA47995", "processname": "WebJobTransactionSubscription",
"processid": "34516", "threadid": "3", "message": "ClientAPI
TcpConnection closed [11:20:01.977: N104.40.20.177:1113,
L100.106.76.88:50043,
{1b588cdf-a98c-4de0-b267-5c31ee51bd1a}]:\nReceived bytes: 134858453,
Sent bytes: 6319\nSend calls: 147, callbacks: 147\nReceive calls:
17625, callbacks: 17624\nClose reason: [Success] Invalid TCP frame
received.\n" }

Is likely that you have a few large events in the same page that you
are reading. Try using a smaller page size on the subscription.

Is there a way to control the page size on the eventstore itself? We are trying to avoid a code change on the client side call.

No it is specified by the subscription, there could be 5 clients with
5 subscriptions all with different page sizes.