Strange persistent subscriptions behaviour leading to a retry loop.

Greetings,

We are experiencing a strange behaviour in our cluster (3 nodes).

Basically, we keep having these messages:

[PID:28278:036 2019.09.02 15:29:42.245 DEBUG PersistentSubscripti] Retrying message $et-order-created::oms-stats-enrichment $persistentsubscription-$et-order-created::oms-stats-enrichment-parked/

[PID:28278:013 2019.09.02 15:30:14.897 INFO PersistentSubscripti] Unable to park message $persistentsubscription-$et-order-created::oms-stats-enrichment-parked/50817 operation failed CommitTimeout retrying.

``

I stripped the log for those two lines because I got a lot of those two lines showing.

I also got these showing up, less frequently, compared to the lines I mentioned above:

[PID:28278:048 2019.09.02 15:39:57.920 DEBUG HttpSendService ] Dropping HTTP send message due to TTL being over. SendGossip To : 10.150.20.67:2113

[PID:28278:017 2019.09.02 15:39:57.920 DEBUG StorageReaderWorker ] Check Stream Access operation has expired for Stream: order-85d0838d-113e-4c91-9469-790bc23ca36d. Operation Expired at 9/2/2019 3:38:28 PM

[PID:28278:013 2019.09.02 15:39:57.920 ERROR GossipServiceBase ] Time difference between us and [10.150.20.64:2113] is too great! UTC now: 2019-09-02 15:39:57.921, peer’s time stamp: 2019-09-02 15:38:18.093.

[PID:28278:009 2019.09.02 15:39:57.921 DEBUG StorageReaderWorker ] Read All Stream Events Forward operation has expired for C:3854858709/P:0. Operation Expired at 9/2/2019 3:38:28 PM

[PID:28278:052 2019.09.02 15:39:57.921 DEBUG StorageReaderWorker ] Read Stream Events Forward operation has expired for Stream: $ce-order, From Event Number: 1739601, Max Count: 500. Operation Expired at 9/2/2019 3:38:28 PM

[PID:28278:013 2019.09.02 15:39:57.921 ERROR GossipServiceBase ] Time difference between us and [10.150.20.64:2113] is too great! UTC now: 2019-09-02 15:39:57.921, peer’s time stamp: 2019-09-02 15:38:19.103.

[PID:28278:013 2019.09.02 15:39:57.921 ERROR GossipServiceBase ] Time difference between us and [10.150.20.64:2113] is too great! UTC now: 2019-09-02 15:39:57.922, peer’s time stamp: 2019-09-02 15:38:20.104.

[PID:28278:013 2019.09.02 15:39:57.921 ERROR GossipServiceBase ] Time difference between us and [10.150.20.67:2113] is too great! UTC now: 2019-09-02 15:39:57.922, peer’s time stamp: 2019-09-02 15:38:20.273.

[PID:28278:013 2019.09.02 15:39:57.922 ERROR GossipServiceBase ] Time difference between us and [10.150.20.64:2113] is too great! UTC now: 2019-09-02 15:39:57.922, peer’s time stamp: 2019-09-02 15:38:21.105.

[PID:28278:013 2019.09.02 15:39:57.922 ERROR GossipServiceBase ] Time difference between us and [10.150.20.64:2113] is too great! UTC now: 2019-09-02 15:39:57.923, peer’s time stamp: 2019-09-02 15:38:22.106.

[PID:28278:013 2019.09.02 15:39:57.922 ERROR GossipServiceBase ] Time difference between us and [10.150.20.64:2113] is too great! UTC now: 2019-09-02 15:39:57.923, peer’s time stamp: 2019-09-02 15:38:23.106.

[PID:28278:013 2019.09.02 15:39:57.923 ERROR GossipServiceBase ] Time difference between us and [10.150.20.67:2113] is too great! UTC now: 2019-09-02 15:39:57.923, peer’s time stamp: 2019-09-02 15:38:25.298.

[PID:28278:013 2019.09.02 15:39:57.923 ERROR GossipServiceBase ] Time difference between us and [10.150.20.67:2113] is too great! UTC now: 2019-09-02 15:39:57.924, peer’s time stamp: 2019-09-02 15:38:26.298.

[PID:28278:013 2019.09.02 15:39:57.923 ERROR GossipServiceBase ] Time difference between us and [10.150.20.64:2113] is too great! UTC now: 2019-09-02 15:39:57.924, peer’s time stamp: 2019-09-02 15:38:27.123.

[PID:28278:051 2019.09.02 15:39:57.923 DEBUG StorageReaderWorker ] Check Stream Access operation has expired for Stream: $projections-$master. Operation Expired at 9/2/2019 3:38:36 PM

[PID:28278:013 2019.09.02 15:39:57.923 ERROR GossipServiceBase ] Time difference between us and [10.150.20.67:2113] is too great! UTC now: 2019-09-02 15:39:57.924, peer’s time stamp: 2019-09-02 15:38:27.298.

[PID:28278:051 2019.09.02 15:39:57.923 DEBUG StorageReaderWorker ] Check Stream Access operation has expired for Stream: $et-order-created. Operation Expired at 9/2/2019 3:38:36 PM

[PID:28278:051 2019.09.02 15:39:57.924 DEBUG StorageReaderWorker ] Check Stream Access operation has expired for Stream: $streams. Operation Expired at 9/2/2019 3:38:37 PM

[PID:28278:013 2019.09.02 15:39:57.924 ERROR GossipServiceBase ] Time difference between us and [10.150.20.67:2113] is too great! UTC now: 2019-09-02 15:39:57.924, peer’s time stamp: 2019-09-02 15:38:28.298.

[PID:28278:044 2019.09.02 15:39:57.924 DEBUG StorageReaderWorker ] Read Stream Events Forward operation has expired for Stream: $projections-$master, From Event Number: 818360, Max Count: 10. Operation Expired at 9/2/2019 3:38:38 PM

[PID:28278:051 2019.09.02 15:39:57.924 DEBUG StorageReaderWorker ] Read All Stream Events Forward operation has expired for C:3854858709/P:0. Operation Expired at 9/2/2019 3:38:38 PM

[PID:28278:013 2019.09.02 15:39:57.924 ERROR GossipServiceBase ] Time difference between us and [10.150.20.67:2113] is too great! UTC now: 2019-09-02 15:39:57.925, peer’s time stamp: 2019-09-02 15:38:29.299.

[PID:28278:013 2019.09.02 15:39:57.924 ERROR GossipServiceBase ] Time difference between us and [10.150.20.67:2113] is too great! UTC now: 2019-09-02 15:39:57.925, peer’s time stamp: 2019-09-02 15:38:30.302.

[PID:28278:013 2019.09.02 15:39:57.924 ERROR GossipServiceBase ] Time difference between us and [10.150.20.67:2113] is too great! UTC now: 2019-09-02 15:39:57.925, peer’s time stamp: 2019-09-02 15:38:31.312.

[PID:28278:013 2019.09.02 15:39:57.925 ERROR GossipServiceBase ] Time difference between us and [10.150.20.67:2113] is too great! UTC now: 2019-09-02 15:39:57.925, peer’s time stamp: 2019-09-02 15:38:32.313.

[PID:28278:013 2019.09.02 15:39:57.925 ERROR GossipServiceBase ] Time difference between us and [10.150.20.64:2113] is too great! UTC now: 2019-09-02 15:39:57.926, peer’s time stamp: 2019-09-02 15:38:34.161.

[PID:28278:054 2019.09.02 15:39:57.925 DEBUG StorageReaderWorker ] Read Stream Events Forward operation has expired for Stream: $ce-order, From Event Number: 1739601, Max Count: 500. Operation Expired at 9/2/2019 3:38:43 PM

[PID:28278:013 2019.09.02 15:39:57.925 ERROR GossipServiceBase ] Time difference between us and [10.150.20.67:2113] is too great! UTC now: 2019-09-02 15:39:57.926, peer’s time stamp: 2019-09-02 15:38:35.328.

[PID:28278:013 2019.09.02 15:39:57.925 ERROR GossipServiceBase ] Time difference between us and [10.150.20.67:2113] is too great! UTC now: 2019-09-02 15:39:57.926, peer’s time stamp: 2019-09-02 15:38:36.330.

[PID:28278:013 2019.09.02 15:39:57.943 ERROR GossipServiceBase ] Time difference between us and [10.150.20.64:2113] is too great! UTC now: 2019-09-02 15:39:57.943, peer’s time stamp: 2019-09-02 15:38:37.184.

[PID:28278:013 2019.09.02 15:39:57.943 ERROR GossipServiceBase ] Time difference between us and [10.150.20.67:2113] is too great! UTC now: 2019-09-02 15:39:57.943, peer’s time stamp: 2019-09-02 15:38:37.340.

[PID:28278:013 2019.09.02 15:39:57.943 ERROR GossipServiceBase ] Time difference between us and [10.150.20.64:2113] is too great! UTC now: 2019-09-02 15:39:57.943, peer’s time stamp: 2019-09-02 15:38:38.189.

[PID:28278:024 2019.09.02 15:39:57.943 DEBUG StorageReaderWorker ] Read All Stream Events Forward operation has expired for C:3854858709/P:0. Operation Expired at 9/2/2019 3:38:48 PM

[PID:28278:013 2019.09.02 15:39:57.943 ERROR GossipServiceBase ] Time difference between us and [10.150.20.67:2113] is too great! UTC now: 2019-09-02 15:39:57.944, peer’s time stamp: 2019-09-02 15:38:38.341.

[PID:28278:050 2019.09.02 15:39:57.943 DEBUG StorageReaderWorker ] Read Stream Events Forward operation has expired for Stream: $projections-$master, From Event Number: 818360, Max Count: 10. Operation Expired at 9/2/2019 3:38:48 PM

[PID:28278:013 2019.09.02 15:39:57.943 ERROR GossipServiceBase ] Time difference between us and [10.150.20.64:2113] is too great! UTC now: 2019-09-02 15:39:57.944, peer’s time stamp: 2019-09-02 15:38:42.199.

[PID:28278:013 2019.09.02 15:39:57.943 ERROR GossipServiceBase ] Time difference between us and [10.150.20.67:2113] is too great! UTC now: 2019-09-02 15:39:57.944, peer’s time stamp: 2019-09-02 15:38:42.368.

[PID:28278:013 2019.09.02 15:39:57.944 ERROR GossipServiceBase ] Time difference between us and [10.150.20.64:2113] is too great! UTC now: 2019-09-02 15:39:57.944, peer’s time stamp: 2019-09-02 15:38:43.209.

[PID:28278:013 2019.09.02 15:39:57.944 ERROR GossipServiceBase ] Time difference between us and [10.150.20.67:2113] is too great! UTC now: 2019-09-02 15:39:57.945, peer’s time stamp: 2019-09-02 15:38:43.377.

[PID:28278:047 2019.09.02 15:39:57.944 DEBUG StorageReaderWorker ] Read Stream Events Forward operation has expired for Stream: $ce-order, From Event Number: 1739601, Max Count: 500. Operation Expired at 9/2/2019 3:38:53 PM

[PID:28278:013 2019.09.02 15:39:57.944 ERROR GossipServiceBase ] Time difference between us and [10.150.20.64:2113] is too great! UTC now: 2019-09-02 15:39:57.945, peer’s time stamp: 2019-09-02 15:38:44.210.

[PID:28278:013 2019.09.02 15:39:57.944 ERROR GossipServiceBase ] Time difference between us and [10.150.20.67:2113] is too great! UTC now: 2019-09-02 15:39:57.945, peer’s time stamp: 2019-09-02 15:38:44.385.

[PID:28278:007 2019.09.02 15:39:57.946 DEBUG PersistentSubscripti] Lost connection from 10.150.80.33:47022

[PID:28278:055 2019.09.02 15:39:57.950 DEBUG StorageReaderWorker ] Read Stream Events Forward operation has expired for Stream: $projections-$master, From Event Number: 818360, Max Count: 10. Operation Expired at 9/2/2019 3:38:58 PM

[PID:28278:013 2019.09.02 15:39:57.950 ERROR GossipServiceBase ] Time difference between us and [10.150.20.64:2113] is too great! UTC now: 2019-09-02 15:39:57.950, peer’s time stamp: 2019-09-02 15:38:48.227.

[PID:28278:038 2019.09.02 15:39:57.950 DEBUG StorageReaderWorker ] Read All Stream Events Forward operation has expired for C:3854858709/P:0. Operation Expired at 9/2/2019 3:38:58 PM

[PID:28278:041 2019.09.02 15:39:57.951 DEBUG StorageReaderWorker ] Read Stream Events Backward operation has expired for Stream: import-cursor, From Event Number: -1, Max Count: 1. Operation Expired at 9/2/2019 3:38:58 PM

[PID:28278:013 2019.09.02 15:39:57.951 ERROR GossipServiceBase ] Time difference between us and [10.150.20.64:2113] is too great! UTC now: 2019-09-02 15:39:57.951, peer’s time stamp: 2019-09-02 15:38:51.251.

[PID:28278:013 2019.09.02 15:39:57.951 ERROR GossipServiceBase ] Time difference between us and [10.150.20.64:2113] is too great! UTC now: 2019-09-02 15:39:57.951, peer’s time stamp: 2019-09-02 15:38:52.261.

[PID:28278:013 2019.09.02 15:39:57.951 ERROR GossipServiceBase ] Time difference between us and [10.150.20.64:2113] is too great! UTC now: 2019-09-02 15:39:57.952, peer’s time stamp: 2019-09-02 15:38:53.266.

[PID:28278:032 2019.09.02 15:39:57.952 DEBUG StorageReaderWorker ] Read Stream Events Forward operation has expired for Stream: $ce-order, From Event Number: 1739601, Max Count: 500. Operation Expired at 9/2/2019 3:39:03 PM

[PID:28278:013 2019.09.02 15:39:57.952 ERROR GossipServiceBase ] Time difference between us and [10.150.20.64:2113] is too great! UTC now: 2019-09-02 15:39:57.952, peer’s time stamp: 2019-09-02 15:38:54.277.

[PID:28278:013 2019.09.02 15:39:57.954 ERROR GossipServiceBase ] Time difference between us and [10.150.20.67:2113] is too great! UTC now: 2019-09-02 15:39:57.954, peer’s time stamp: 2019-09-02 15:38:54.426.

[PID:28278:013 2019.09.02 15:39:57.954 ERROR GossipServiceBase ] Time difference between us and [10.150.20.64:2113] is too great! UTC now: 2019-09-02 15:39:57.954, peer’s time stamp: 2019-09-02 15:38:56.287.

[PID:28278:013 2019.09.02 15:39:57.954 ERROR GossipServiceBase ] Time difference between us and [10.150.20.67:2113] is too great! UTC now: 2019-09-02 15:39:57.954, peer’s time stamp: 2019-09-02 15:38:57.430.

[PID:28278:035 2019.09.02 15:39:57.954 DEBUG StorageReaderWorker ] Read Stream Events Forward operation has expired for Stream: $projections-$master, From Event Number: 818360, Max Count: 10. Operation Expired at 9/2/2019 3:39:08 PM

[PID:28278:026 2019.09.02 15:39:57.954 DEBUG StorageReaderWorker ] Read All Stream Events Forward operation has expired for C:3854858709/P:0. Operation Expired at 9/2/2019 3:39:08 PM

[PID:28278:047 2019.09.02 15:39:57.956 DEBUG StorageReaderWorker ] Read Stream Events Backward operation has expired for Stream: import-cursor, From Event Number: -1, Max Count: 1. Operation Expired at 9/2/2019 3:39:17 PM

[PID:28278:027 2019.09.02 15:39:57.956 DEBUG StorageReaderWorker ] Read Stream Events Forward operation has expired for Stream: $projections-$master, From Event Number: 818360, Max Count: 10. Operation Expired at 9/2/2019 3:39:18 PM

[PID:28278:027 2019.09.02 15:39:57.956 DEBUG StorageReaderWorker ] Read Stream Events Forward operation has expired for Stream: $persistentsubscription-$et-order-created::steering-wheel-parked, From Event Number: 0, Max Count: 500. O

peration Expired at 9/2/2019 3:39:20 PM

``

I’m more concerned about those “Time difference between us [] and [] is too great!” error because I never got those for the last five years I use the database.

“Operation has expired from Stream” messages seem to be a byproduct of the issue we are facing. Something funny though, only read operations are failing (persistent subscriptions included), the database still writes events as those come (very low workload ~30events/sec, < 500KB of payload each).

What made me think something wrong was happening is when going in the persistent subscription page, it will either take about a minute to show the subscription list or failed with a 406/408 HTTP code error.

When logging into one node instance, there is always one core always at 100%. We believe the trigger of this situation happened when we tapped “Replay parked messages” button (~3 hours ago).

It is not a production application (still in beta stage), so this is not a blocking situation for us. The database has very few events, just under 1M (< 500KB of payload)

Env:

Cluster of 3 nodes (using gossip seed only).

Each node env:

  • 4 cores CPU and 8GB RAM

  • Ubuntu 16.04 LTS

  • EventStore 4.1.0.0 (with those 4.1 patches applied)

  • 250 GB hard drive (we have checked those aren’t full)

Thanks for your time.

ps: I can provide logs but can’t share the database.

After a day of investigation, it turned out we were also using --cluster-dns settings too. Our internal DNS had resolution issue from time to time depending on which subnet our machines belong to. Disabling the DNS discovery solved our problem in this case.