eventstore cluster does not work under mono

I noticed I had a duplication of port settings in the first command so I removed --tcp-port=1113 --http-port=2113, with the same result though…

now:

mono-sgen EventStore.ClusterNode.exe --log=/var/log/eventstore --int-ip=0.0.0.0 --ext-ip=0.0.0.0 --int-tcp-port=1112 --ext-tcp-port=1113 --int-http-port=2112 --ext-http-port=2113 --cluster-size=3 --use-dns-discovery- --db=/data/eventstore --http-prefix=http://*:2113/ --gossip-seed=172.17.8.101:2113 --gossip-seed=172.17.8.102:2113 --gossip-seed=172.17.8.103:2113

log:

Jul 24 15:54:52 core1 bash[15516]: ES VERSION: 3.0.0.0 (master/30f7fa64d73ba65028e5a6ca720639985cac1458, Tue, 18 Mar 2014 20:03:41 +0000)

Jul 24 15:54:52 core1 bash[15516]: OS: Unknown (Unix 3.15.2.0)

Jul 24 15:54:52 core1 bash[15516]: RUNTIME: 3.2.8 (Debian 3.2.8+dfsg-7) (64-bit)

Jul 24 15:54:52 core1 bash[15516]: GC: 2 GENERATIONS

Jul 24 15:54:52 core1 bash[15516]: LOGS: /var/log/eventstore

Jul 24 15:54:52 core1 bash[15516]: SHOW HELP: False ()

Jul 24 15:54:52 core1 bash[15516]: SHOW VERSION: False ()

Jul 24 15:54:52 core1 bash[15516]: LOGS DIR: /var/log/eventstore (–logsdir from command line)

Jul 24 15:54:52 core1 bash[15516]: CONFIGS: ()

Jul 24 15:54:52 core1 bash[15516]: DEFINES: ()

Jul 24 15:54:52 core1 bash[15516]: INTERNAL IP: 0.0.0.0 (–internal-ip from command line)

Jul 24 15:54:52 core1 bash[15516]: EXTERNAL IP: 0.0.0.0 (–external-ip from command line)

Jul 24 15:54:52 core1 bash[15516]: INTERNAL HTTP PORT: 2112 (–internal-http-port from command line)

Jul 24 15:54:52 core1 bash[15516]: EXTERNAL HTTP PORT: 2113 (–external-http-port from command line)

Jul 24 15:54:52 core1 bash[15516]: INTERNAL TCP PORT: 1112 (–internal-tcp-port from command line)

Jul 24 15:54:52 core1 bash[15516]: INTERNAL SECURE TCP PORT: 0 ()

Jul 24 15:54:52 core1 bash[15516]: EXTERNAL TCP PORT: 1113 (–external-tcp-port from command line)

Jul 24 15:54:52 core1 bash[15516]: EXTERNAL SECURE TCP PORT: 0 ()

Jul 24 15:54:52 core1 bash[15516]: FORCE: False ()

Jul 24 15:54:52 core1 bash[15516]: CLUSTER SIZE: 3 (–cluster-size from command line)

Jul 24 15:54:52 core1 bash[15516]: MIN FLUSH DELAY MS: 2 ()

Jul 24 15:54:52 core1 bash[15516]: NODE PRIORITY: 0 ()

Jul 24 15:54:52 core1 bash[15516]: COMMIT COUNT: 2 ()

Jul 24 15:54:52 core1 bash[15516]: PREPARE COUNT: 2 ()

Jul 24 15:54:52 core1 bash[15516]: MAX MEM TABLE SIZE: 1000000 ()

Jul 24 15:54:52 core1 bash[15516]: DISCOVER VIA DNS: False (–use-dns-discovery from command line)

Jul 24 15:54:52 core1 bash[15516]: CLUSTER DNS: fake.dns ()

Jul 24 15:54:52 core1 bash[15516]: CLUSTER GOSSIP PORT: 30777 ()

Jul 24 15:54:52 core1 bash[15516]: GOSSIP SEEDS: 172.17.8.101:2113, 172.17.8.102:2113, 172.17.8.103:2113 (–gossip-seed from command line)

Jul 24 15:54:52 core1 bash[15516]: STATS PERIOD SEC: 30 ()

Jul 24 15:54:52 core1 bash[15516]: CACHED CHUNKS: -1 ()

Jul 24 15:54:52 core1 bash[15516]: CHUNKS CACHE SIZE: 536871424 ()

Jul 24 15:54:52 core1 bash[15516]: DB PATH: /data/eventstore (–db from command line)

Jul 24 15:54:52 core1 bash[15516]: IN MEM DB: False ()

Jul 24 15:54:52 core1 bash[15516]: SKIP DB VERIFY: False ()

Jul 24 15:54:52 core1 bash[15516]: RUN PROJECTIONS: System ()

Jul 24 15:54:52 core1 bash[15516]: PROJECTION THREADS: 3 ()

Jul 24 15:54:52 core1 bash[15516]: WORKER THREADS: 5 ()

Jul 24 15:54:52 core1 bash[15516]: HTTP PREFIXES: http://*:2113/ (–http-prefix from command line)

Jul 24 15:54:52 core1 bash[15516]: ENABLE TRUSTED AUTH: False ()

Jul 24 15:54:52 core1 bash[15516]: CERTIFICATE STORE: ()

Jul 24 15:54:52 core1 bash[15516]: CERTIFICATE NAME: ()

Jul 24 15:54:52 core1 bash[15516]: CERTIFICATE FILE: ()

Jul 24 15:54:52 core1 bash[15516]: CERTIFICATE PASSWORD: ()

Jul 24 15:54:52 core1 bash[15516]: USE INTERNAL SSL: False ()

Jul 24 15:54:52 core1 bash[15516]: SSL TARGET HOST: n/a ()

Jul 24 15:54:52 core1 bash[15516]: SSL VALIDATE SERVER: True ()

Jul 24 15:54:52 core1 bash[15516]: AUTHENTICATION TYPE: internal ()

Jul 24 15:54:52 core1 bash[15516]: AUTHENTICATION CONFIG FILE: ()

Jul 24 15:54:52 core1 bash[15516]: PREPARE TIMEOUT MS: 2000 ()

Jul 24 15:54:52 core1 bash[15516]: COMMIT TIMEOUT MS: 2000 ()

Jul 24 15:54:52 core1 bash[15516]: DISABLE SCAVENGE MERGING: False ()

Jul 24 15:54:52 core1 bash[15516]: GOSSIP ON EXT: True ()

Jul 24 15:54:52 core1 bash[15516]: STATS ON EXT: True ()

Jul 24 15:54:52 core1 bash[15516]: ADMIN ON EXT: True ()

Jul 24 15:54:52 core1 bash[15516]: [00001,01,15:54:52.814] Quorum size set to 2

Jul 24 15:54:52 core1 bash[15516]: [00001,01,15:54:52.820] Can’t find plugins path: /opt/eventstore/plugins

Jul 24 15:54:52 core1 bash[15516]: [00001,01,15:54:52.895]

Jul 24 15:54:52 core1 bash[15516]: INSTANCE ID: bab429e2-3959-416e-bdc3-db25fa419310

Jul 24 15:54:52 core1 bash[15516]: DATABASE: /data/eventstore

Jul 24 15:54:52 core1 bash[15516]: WRITER CHECKPOINT: 0 (0x0)

Jul 24 15:54:52 core1 bash[15516]: CHASER CHECKPOINT: 0 (0x0)

Jul 24 15:54:52 core1 bash[15516]: EPOCH CHECKPOINT: -1 (0xFFFFFFFFFFFFFFFF)

Jul 24 15:54:52 core1 bash[15516]: TRUNCATE CHECKPOINT: -1 (0xFFFFFFFFFFFFFFFF)

Jul 24 15:54:53 core1 bash[15516]: [00001,01,15:54:53.034] MessageHierarchy initialization took 00:00:00.1110701.

Jul 24 15:54:53 core1 bash[15516]: [00001,01,15:54:53.090] CACHED TFChunk #0-0 (chunk-000000.000000) in 00:00:00.0018395.

Jul 24 15:54:53 core1 bash[15516]: [00001,01,15:54:53.188] Starting MiniWeb for /web/es/js/projections ==> /opt/eventstore/singlenode-web/js/projections

Jul 24 15:54:53 core1 bash[15516]: [00001,01,15:54:53.188] Starting MiniWeb for /web/es/js/projections/v8/Prelude ==> /opt/eventstore/Prelude

Jul 24 15:54:53 core1 bash[15516]: [00001,01,15:54:53.188] Starting MiniWeb for /web/es/js/projections/resources ==> /opt/eventstore/web-resources/js

Jul 24 15:54:53 core1 bash[15516]: [00001,01,15:54:53.189] Binding MiniWeb to /web/es/js/projections/{*remaining_path}

Jul 24 15:54:53 core1 bash[15516]: [00001,01,15:54:53.189] Binding MiniWeb to /web/es/js/projections/v8/Prelude/{*remaining_path}

Jul 24 15:54:53 core1 bash[15516]: [00001,01,15:54:53.190] Binding MiniWeb to /web/es/js/projections/resources/{*remaining_path}

Jul 24 15:54:53 core1 bash[15516]: [00001,01,15:54:53.190] Binding MiniWeb to /web/es/js/projections/{*remaining_path}

Jul 24 15:54:53 core1 bash[15516]: [00001,01,15:54:53.190] Binding MiniWeb to /web/es/js/projections/v8/Prelude/{*remaining_path}

Jul 24 15:54:53 core1 bash[15516]: [00001,01,15:54:53.190] Binding MiniWeb to /web/es/js/projections/resources/{*remaining_path}

Jul 24 15:54:53 core1 bash[15516]: [00001,01,15:54:53.195] Starting MiniWeb for /web ==> /opt/eventstore/clusternode-web

Jul 24 15:54:53 core1 bash[15516]: [00001,01,15:54:53.195] Starting MiniWeb for /web/es ==> /opt/eventstore/es-common-web

Jul 24 15:54:53 core1 bash[15516]: [00001,01,15:54:53.195] Binding MiniWeb to /web/{*remaining_path}

Jul 24 15:54:53 core1 bash[15516]: [00001,01,15:54:53.195] Binding MiniWeb to /web/es/{*remaining_path}

Jul 24 15:54:53 core1 bash[15516]: [00001,01,15:54:53.196] Starting MiniWeb for /web/users ==> /opt/eventstore/Users/web

Jul 24 15:54:53 core1 bash[15516]: [00001,01,15:54:53.196] Binding MiniWeb to /web/users/{*remaining_path}

Jul 24 15:54:53 core1 bash[15516]: [00001,01,15:54:53.196] Starting MiniWeb for /web ==> /opt/eventstore/clusternode-web

Jul 24 15:54:53 core1 bash[15516]: [00001,01,15:54:53.196] Starting MiniWeb for /web/es ==> /opt/eventstore/es-common-web

Jul 24 15:54:53 core1 bash[15516]: [00001,01,15:54:53.196] Binding MiniWeb to /web/{*remaining_path}

Jul 24 15:54:53 core1 bash[15516]: [00001,01,15:54:53.196] Binding MiniWeb to /web/es/{*remaining_path}

Jul 24 15:54:53 core1 bash[15516]: [00001,01,15:54:53.196] Starting MiniWeb for /web/users ==> /opt/eventstore/Users/web

Jul 24 15:54:53 core1 bash[15516]: [00001,01,15:54:53.196] Binding MiniWeb to /web/users/{*remaining_path}

Jul 24 15:54:53 core1 bash[15516]: [00001,10,15:54:53.213] ========== [0.0.0.0:2112] SYSTEM INIT…

Jul 24 15:54:53 core1 bash[15516]: [00001,10,15:54:53.266] Starting Normal TCP listening on TCP endpoint: 0.0.0.0:1113.

Jul 24 15:54:53 core1 bash[15516]: [00001,13,15:54:53.275] TableIndex initialization…

Jul 24 15:54:53 core1 bash[15516]: [00001,10,15:54:53.278] Starting Normal TCP listening on TCP endpoint: 0.0.0.0:1112.

Jul 24 15:54:53 core1 bash[15516]: [00001,13,15:54:53.301] ReadIndex building…

Jul 24 15:54:53 core1 bash[15516]: [00001,10,15:54:53.303] Starting HTTP server on [http://*:2113/]…

Jul 24 15:54:53 core1 bash[15516]: [00001,13,15:54:53.304] ReadIndex rebuilding done: total processed 0 records, time elapsed: 00:00:00.0027390.

Jul 24 15:54:53 core1 bash[15516]: [00001,10,15:54:53.306] HTTP server is up and listening on [http://*:2113/]

Jul 24 15:54:53 core1 bash[15516]: [00001,10,15:54:53.308] Starting HTTP server on [http://0.0.0.0:2112/]…

Jul 24 15:54:53 core1 bash[15516]: [00001,10,15:54:53.309] HTTP server is up and listening on [http://0.0.0.0:2112/]

Jul 24 15:54:53 core1 bash[15516]: [00001,10,15:54:53.327] SLOW QUEUE MSG [MainQueue]: SystemInit - 113ms. Q: 0/7.

Jul 24 15:54:53 core1 bash[15516]: [00001,10,15:54:53.327] ========== [0.0.0.0:2112] Service ‘StorageWriter’ initialized.

Jul 24 15:54:53 core1 bash[15516]: [00001,10,15:54:53.327] ========== [0.0.0.0:2112] Service ‘StorageReader’ initialized.

Jul 24 15:54:53 core1 bash[15516]: [00001,10,15:54:53.364] CLUSTER HAS CHANGED

Jul 24 15:54:53 core1 bash[15516]: Old:

Jul 24 15:54:53 core1 bash[15516]: VND {bab429e2-3959-416e-bdc3-db25fa419310} [Unknown, 0.0.0.0:1112, n/a, 0.0.0.0:1113, n/a, 0.0.0.0:2112, 0.0.0.0:2113] -1/0/0/E-1@-1:{00000000-0000-0000-0000-000000000000} | 2014-07-24 15:54:53.309

Jul 24 15:54:53 core1 bash[15516]: New:

Jul 24 15:54:53 core1 bash[15516]: MAN {00000000-0000-0000-0000-000000000000} [Manager, 172.17.8.103:2113, 172.17.8.103:2113] | 2014-07-24 15:54:53.330

Jul 24 15:54:53 core1 bash[15516]: MAN {00000000-0000-0000-0000-000000000000} [Manager, 172.17.8.102:2113, 172.17.8.102:2113] | 2014-07-24 15:54:53.330

Jul 24 15:54:53 core1 bash[15516]: MAN {00000000-0000-0000-0000-000000000000} [Manager, 172.17.8.101:2113, 172.17.8.101:2113] | 2014-07-24 15:54:53.330

Jul 24 15:54:53 core1 bash[15516]: VND {bab429e2-3959-416e-bdc3-db25fa419310} [Unknown, 0.0.0.0:1112, n/a, 0.0.0.0:1113, n/a, 0.0.0.0:2112, 0.0.0.0:2113] -1/0/0/E-1@-1:{00000000-0000-0000-0000-000000000000} | 2014-07-24 15:54:53.309

Jul 24 15:54:53 core1 bash[15516]: --------------------------------------------------------------------------------

Jul 24 15:54:53 core1 bash[15516]: [00001,10,15:54:53.366] ========== [0.0.0.0:2112] Service ‘StorageChaser’ initialized.

Jul 24 15:54:53 core1 bash[15516]: [00001,10,15:54:53.375] ========== [0.0.0.0:2112] SYSTEM START…

Jul 24 15:54:53 core1 bash[15516]: [00001,10,15:54:53.383] ========== [0.0.0.0:2112] IS UNKNOWN!!! WHOA!!!

Jul 24 15:54:53 core1 bash[15516]: [00001,10,15:54:53.431] ELECTIONS: STARTING ELECTIONS.

Jul 24 15:54:53 core1 bash[15516]: [00001,10,15:54:53.431] ELECTIONS: (V=0) SHIFT TO LEADER ELECTION.

Jul 24 15:54:53 core1 bash[15516]: [00001,10,15:54:53.434] ELECTIONS: (V=0) VIEWCHANGE FROM [0.0.0.0:2112, {bab429e2-3959-416e-bdc3-db25fa419310}].

Jul 24 15:54:53 core1 bash[15516]: [00001,10,15:54:53.458] SLOW BUS MSG [MainBus]: StartElections - 48ms. Handler: ElectionsService.

Jul 24 15:54:53 core1 bash[15516]: [00001,10,15:54:53.460] SLOW QUEUE MSG [MainQueue]: StartElections - 72ms. Q: 0/3.

Jul 24 15:54:54 core1 bash[15516]: [00001,10,15:54:54.061] Looks like node [172.17.8.102:2113] is DEAD (Gossip send failed).

Jul 24 15:54:54 core1 bash[15516]: [00001,10,15:54:54.061] CLUSTER HAS CHANGED (gossip send failed to [172.17.8.102:2113])

Jul 24 15:54:54 core1 bash[15516]: Old:

Jul 24 15:54:54 core1 bash[15516]: MAN {00000000-0000-0000-0000-000000000000} [Manager, 172.17.8.103:2113, 172.17.8.103:2113] | 2014-07-24 15:54:53.330

Jul 24 15:54:54 core1 bash[15516]: MAN {00000000-0000-0000-0000-000000000000} [Manager, 172.17.8.102:2113, 172.17.8.102:2113] | 2014-07-24 15:54:53.330

Jul 24 15:54:54 core1 bash[15516]: MAN {00000000-0000-0000-0000-000000000000} [Manager, 172.17.8.101:2113, 172.17.8.101:2113] | 2014-07-24 15:54:53.330

Jul 24 15:54:54 core1 bash[15516]: VND {bab429e2-3959-416e-bdc3-db25fa419310} [Unknown, 0.0.0.0:1112, n/a, 0.0.0.0:1113, n/a, 0.0.0.0:2112, 0.0.0.0:2113] -1/0/0/E-1@-1:{00000000-0000-0000-0000-000000000000} | 2014-07-24 15:54:53.981

Jul 24 15:54:54 core1 bash[15516]: New:

Jul 24 15:54:54 core1 bash[15516]: MAN {00000000-0000-0000-0000-000000000000} [Manager, 172.17.8.103:2113, 172.17.8.103:2113] | 2014-07-24 15:54:53.330

Jul 24 15:54:54 core1 bash[15516]: MAN {00000000-0000-0000-0000-000000000000} [Manager, 172.17.8.102:2113, 172.17.8.102:2113] | 2014-07-24 15:54:54.061

Jul 24 15:54:54 core1 bash[15516]: MAN {00000000-0000-0000-0000-000000000000} [Manager, 172.17.8.101:2113, 172.17.8.101:2113] | 2014-07-24 15:54:53.330

Jul 24 15:54:54 core1 bash[15516]: VND {bab429e2-3959-416e-bdc3-db25fa419310} [Unknown, 0.0.0.0:1112, n/a, 0.0.0.0:1113, n/a, 0.0.0.0:2112, 0.0.0.0:2113] -1/0/0/E-1@-1:{00000000-0000-0000-0000-000000000000} | 2014-07-24 15:54:53.981

Jul 24 15:54:54 core1 bash[15516]: --------------------------------------------------------------------------------

Jul 24 15:54:55 core1 bash[15516]: [00001,10,15:54:54.346] Looks like node [172.17.8.103:2113] is DEAD (Gossip send failed).

Jul 24 15:54:55 core1 bash[15516]: [00001,10,15:54:54.346] CLUSTER HAS CHANGED (gossip send failed to [172.17.8.103:2113])

Jul 24 15:54:55 core1 bash[15516]: Old:

Jul 24 15:54:55 core1 bash[15516]: MAN {00000000-0000-0000-0000-000000000000} [Manager, 172.17.8.103:2113, 172.17.8.103:2113] | 2014-07-24 15:54:53.330

Jul 24 15:54:55 core1 bash[15516]: MAN {00000000-0000-0000-0000-000000000000} [Manager, 172.17.8.102:2113, 172.17.8.102:2113] | 2014-07-24 15:54:54.061

Jul 24 15:54:55 core1 bash[15516]: MAN {00000000-0000-0000-0000-000000000000} [Manager, 172.17.8.101:2113, 172.17.8.101:2113] | 2014-07-24 15:54:53.330

Jul 24 15:54:55 core1 bash[15516]: VND {bab429e2-3959-416e-bdc3-db25fa419310} [Unknown, 0.0.0.0:1112, n/a, 0.0.0.0:1113, n/a, 0.0.0.0:2112, 0.0.0.0:2113] -1/0/0/E-1@-1:{00000000-0000-0000-0000-000000000000} | 2014-07-24 15:54:54.285

Jul 24 15:54:55 core1 bash[15516]: New:

Jul 24 15:54:55 core1 bash[15516]: MAN {00000000-0000-0000-0000-000000000000} [Manager, 172.17.8.103:2113, 172.17.8.103:2113] | 2014-07-24 15:54:54.346

Jul 24 15:54:55 core1 bash[15516]: MAN {00000000-0000-0000-0000-000000000000} [Manager, 172.17.8.102:2113, 172.17.8.102:2113] | 2014-07-24 15:54:54.061

Jul 24 15:54:55 core1 bash[15516]: MAN {00000000-0000-0000-0000-000000000000} [Manager, 172.17.8.101:2113, 172.17.8.101:2113] | 2014-07-24 15:54:53.330

Jul 24 15:54:55 core1 bash[15516]: VND {bab429e2-3959-416e-bdc3-db25fa419310} [Unknown, 0.0.0.0:1112, n/a, 0.0.0.0:1113, n/a, 0.0.0.0:2112, 0.0.0.0:2113] -1/0/0/E-1@-1:{00000000-0000-0000-0000-000000000000} | 2014-07-24 15:54:54.285

Jul 24 15:54:55 core1 bash[15516]: --------------------------------------------------------------------------------

Jul 24 15:54:55 core1 bash[15516]: [00001,10,15:54:54.465] ELECTIONS: (V=0) TIMED OUT! (S=ElectingLeader, M=).

Jul 24 15:54:55 core1 bash[15516]: [00001,10,15:54:54.465] ELECTIONS: (V=1) SHIFT TO LEADER ELECTION.

Jul 24 15:54:55 core1 bash[15516]: [00001,10,15:54:54.465] ELECTIONS: (V=1) VIEWCHANGE FROM [0.0.0.0:2112, {bab429e2-3959-416e-bdc3-db25fa419310}].

Jul 24 15:54:55 core1 bash[15516]: [00001,10,15:54:54.562] Looks like node [172.17.8.101:2113] is DEAD (Gossip send failed).

Jul 24 15:54:55 core1 bash[15516]: [00001,10,15:54:54.562] CLUSTER HAS CHANGED (gossip send failed to [172.17.8.101:2113])

Jul 24 15:54:55 core1 bash[15516]: Old:

Jul 24 15:54:55 core1 bash[15516]: MAN {00000000-0000-0000-0000-000000000000} [Manager, 172.17.8.103:2113, 172.17.8.103:2113] | 2014-07-24 15:54:54.346

Jul 24 15:54:55 core1 bash[15516]: MAN {00000000-0000-0000-0000-000000000000} [Manager, 172.17.8.102:2113, 172.17.8.102:2113] | 2014-07-24 15:54:54.061

Jul 24 15:54:55 core1 bash[15516]: MAN {00000000-0000-0000-0000-000000000000} [Manager, 172.17.8.101:2113, 172.17.8.101:2113] | 2014-07-24 15:54:53.330

Jul 24 15:54:55 core1 bash[15516]: VND {bab429e2-3959-416e-bdc3-db25fa419310} [Unknown, 0.0.0.0:1112, n/a, 0.0.0.0:1113, n/a, 0.0.0.0:2112, 0.0.0.0:2113] -1/0/0/E-1@-1:{00000000-0000-0000-0000-000000000000} | 2014-07-24 15:54:54.486

Jul 24 15:54:55 core1 bash[15516]: New:

Jul 24 15:54:55 core1 bash[15516]: MAN {00000000-0000-0000-0000-000000000000} [Manager, 172.17.8.103:2113, 172.17.8.103:2113] | 2014-07-24 15:54:54.346

Jul 24 15:54:55 core1 bash[15516]: MAN {00000000-0000-0000-0000-000000000000} [Manager, 172.17.8.102:2113, 172.17.8.102:2113] | 2014-07-24 15:54:54.061

Jul 24 15:54:55 core1 bash[15516]: MAN {00000000-0000-0000-0000-000000000000} [Manager, 172.17.8.101:2113, 172.17.8.101:2113] | 2014-07-24 15:54:54.562

Jul 24 15:54:55 core1 bash[15516]: VND {bab429e2-3959-416e-bdc3-db25fa419310} [Unknown, 0.0.0.0:1112, n/a, 0.0.0.0:1113, n/a, 0.0.0.0:2112, 0.0.0.0:2113] -1/0/0/E-1@-1:{00000000-0000-0000-0000-000000000000} | 2014-07-24 15:54:54.486

Jul 24 15:54:55 core1 bash[15516]: --------------------------------------------------------------------------------

Jul 24 15:54:55 core1 bash[15516]: [00001,10,15:54:55.466] ELECTIONS: (V=1) TIMED OUT! (S=ElectingLeader, M=).

Jul 24 15:54:55 core1 bash[15516]: [00001,10,15:54:55.468] ELECTIONS: (V=2) SHIFT TO LEADER ELECTION.

Jul 24 15:54:55 core1 bash[15516]: [00001,10,15:54:55.468] ELECTIONS: (V=2) VIEWCHANGE FROM [0.0.0.0:2112, {bab429e2-3959-416e-bdc3-db25fa419310}].

Jul 24 15:54:56 core1 bash[15516]: [00001,10,15:54:56.469] ELECTIONS: (V=2) TIMED OUT! (S=ElectingLeader, M=).

Jul 24 15:54:56 core1 bash[15516]: [00001,10,15:54:56.469] ELECTIONS: (V=3) SHIFT TO LEADER ELECTION.

Jul 24 15:54:56 core1 bash[15516]: [00001,10,15:54:56.469] ELECTIONS: (V=3) VIEWCHANGE FROM [0.0.0.0:2112, {bab429e2-3959-416e-bdc3-db25fa419310}].

Jul 24 15:54:57 core1 bash[15516]: [00001,10,15:54:57.471] ELECTIONS: (V=3) TIMED OUT! (S=ElectingLeader, M=).

Jul 24 15:54:57 core1 bash[15516]: [00001,10,15:54:57.471] ELECTIONS: (V=4) SHIFT TO LEADER ELECTION.

Jul 24 15:54:57 core1 bash[15516]: [00001,10,15:54:57.471] ELECTIONS: (V=4) VIEWCHANGE FROM [0.0.0.0:2112, {bab429e2-3959-416e-bdc3-db25fa419310}].

Jul 24 15:54:58 core1 bash[15516]: [00001,10,15:54:58.471] ELECTIONS: (V=4) TIMED OUT! (S=ElectingLeader, M=).

Jul 24 15:54:58 core1 bash[15516]: [00001,10,15:54:58.471] ELECTIONS: (V=5) SHIFT TO LEADER ELECTION.

Jul 24 15:54:58 core1 bash[15516]: [00001,10,15:54:58.471] ELECTIONS: (V=5) VIEWCHANGE FROM [0.0.0.0:2112, {bab429e2-3959-416e-bdc3-db25fa419310}].

Jul 24 15:54:59 core1 bash[15516]: [00001,10,15:54:59.472] ELECTIONS: (V=5) TIMED OUT! (S=ElectingLeader, M=).

what is the output of ifconfig in your image? Does 172.17.8.103 exist? Could this possibly a permissions etc issue as well?

Jul 24 15:28:24 core3 bash[20310]: Exit reason: Http async server failed to start listening at [http://172.17.8.103:2113/].

Also note that in https://github.com/eventstore/eventstore/wiki/Setting-up-an-OSS-cluster gossip seeds point to internal ports.

In

mono-sgen EventStore.ClusterNode.exe --log=/var/log/eventstore --int-ip=0.0.0.0 --ext-ip=0.0.0.0 --int-tcp-port=1112 --ext-tcp-port=1113 --int-http-port=2112 --ext-http-port=2113 --cluster-size=3 --use-dns-discovery- --db=/data/eventstore --http-prefix=http://*:2113/ --gossip-seed=172.17.8.101:2113 --gossip-seed=172.17.8.102:2113 --gossip-seed=172.17.8.103:2113

You point to external it seems

Ok so changing it to the internal http port for gossip seeds didn’t help unfortunately, but I will make sure to make that change in my configuration, thanks!

This was the command:

mono-sgen EventStore.ClusterNode.exe --log=/var/log/eventstore --int-ip=0.0.0.0 --ext-ip=0.0.0.0 --int-tcp-port=1112 --ext-tcp-port=1113 --int-http-port=2112 --ext-http-port=2113 --cluster-size=3 --use-dns-discovery- --db=/data/eventstore --http-prefix=http://*:2113/ --gossip-seed=172.17.8.101:2112 --gossip-seed=172.17.8.102:2112 --gossip-seed=172.17.8.103:2112

However regarding the crash you are totally right, it seems like in the container the external interface is not available… there’s a whole different range of IPs there and the point is that it should be managed by Docker without interference it seems. So this kind of leaves me with the 0.0.0.0 binding which does work from a network perspective. Port 2112 is accessible from all nodes on their host’s external NIC so I’d think it would be fine.

However, running curl against a node’s internal http port gives a 400:

core@core2 ~ $ curl 172.17.8.101:2112

Bad Request (Invalid host)

Is it possible that the invalid host thing might be the issue preventing the ES instances from talking to each other?

Yes it absolutely will stop that on mono

Hi

I am getting the exact same behaviour on three separate VM’s using VmWare on a laptop with following setup:

  1. Ubuntu 14.04.1 LTS server

  2. Both ES 3.0.0 binary release and compiled dev branch using 3.2.8 mono.

I added some basic logging to the code and I discovered, that the send gossip fails with “Aborted.” error. When I ignored this error (did not publish a GossipSendFailed message) the ES started failing on TCP heartbeats, like this:

[02932,1942,19:28:17.153] ES TcpConnection closed [19:28:17.153: N172.16.1.129:1112, L, {93506e3d-ce29-4083-b9ca-50601941ae8b}]:

Received bytes: 0, Sent bytes: 0

Send calls: 0, callbacks: 0

Receive calls: 1, callbacks: 1

Close reason: [Success] Socket closed

[02932,1942,19:28:17.153] Connection ‘master-normal’ [172.16.1.129:1112, {93506e3d-ce29-4083-b9ca-50601941ae8b}] closed: Success.

[02932,10,19:28:17.153] Looks like node [172.16.1.129:1112] is DEAD (TCP connection lost).

The whole cluster was stuck in an election loop, like the OP described (albeit skipping the http gossip send failed message). I tried messing with the heartbeat timeouts and intervals but that didn’t help much.

Does anyone have any idea whether this is an configuration/infrastructure issue or a real mono regression (which would be a bummer since it would make running es cluster on anything else than windows very hard) or maybe just a bug in es code?

Kind Regards

Martin

We were getting very similar behaviour. We found that the default value for MONO_THREADS_PER_CPU was too low (5 or something). When we set MONO_THREADS_PER_CPU=1000 the problem disappeared. I have not looked into it further but I am assuming that gossip continually fails on all of the nodes because their threadpools are depleted.

cheers,

Matt

This sounds plausible and would explain why we’ve only seen it on certain types of hardware and/or virtual machines. I’ll investigate more tomorrow with this, I now have a repro of the issue.

What is the repro? Set to 2 threads?

I don't want to preempt James but we reproduce by deploying a cluster in which each node is an Ubuntu vm with either 1 or 2 vCpus. You will notice that gossip fails regularly on all nodes. You will also see that the clusternode process accumulates tcp connections in CLOSE_WAIT state until it runs out of files.

If you set the MONO_THREADS_PER_CPU env variable to 100 or more (even 20 might be enough) on a 1 vCpu vm the problems go away.

It would probably behave the same running all the nodes on a single 1 or 2 vCpu vm as well.

From our experience, deploy the 3.0.0 linux build to a 3 node cluster on
Ubuntu, each vm having either 1 or 2 vCPUs. I think you could deploy all nodes on a single 1 or 2 vCPU vm with the same results. You will see
gossip regularly failing on each node and each clusternode process will
accumulate tcp connections in CLOSE_WAIT state until it runs out of files.

Each of the of the following applied individually “fix” the issue, although the only real fix is the MONO_THREADS_PER_CPU change.

  • Increase the vCPU count to 4 on each VM.
  • Change the MONO_THREADS_PER_CPU env variable to 100+ on a 1 or 2 vCPU VM. I suspect that even 20 is high enough though.
  • Turn off projections and set the worker thread limit to 1 on a 1 or 2 vCPU VM
  • Set the gossip interval very high, like 10 seconds and the timeout to 5 seconds on a 1 or 2 vCPU VM

As I mentioned, the MONO_THREADS_PER_CPU setting is the real fix, all the rest are bandaids that support the running out of threads hypothesis. I have repro’d this on VMWare and appropriately sized AWS instances.

Thanks we will take a look at it should be an easy fix.

This appears to be the case. The easiest fix is to modify the ./run-node script to set this locally for the process. However, there might be a better way, I’ll investigate.

This was a great catch (we don’t often test on small machines - normally 4 core at which point 20 threads is fine!).

Cheers,

James

Thanks so much for finding this! I’ve been struggling with getting a cluster to gossip on Azure and this was the missing fix. (That, and I had to set the gossip port to the value of the internal HTTP port, apparently. I don’t know why…)

I’m going to update the Azure setup docs further to include a section on clustering using this once I’m sure it’s stable in production.

I am using 3.0.5 on ubuntu 14.10 and I still get the same problem. Have you fixed this problem? This is the command line to start the cluster. Is there anything I have missed?

./clusternode --db=/event-store/data --log=/event-store/log --ext-ip=0.0.0.0 --ext-tcp-port=1113 --run-projections=All --http-prefixes=

http://*:2113/ --cluster-size=3 --discover-via-dns=false --gossip-timeout-ms 2000 --int-tcp-heartbeat-timeout 2000 --gossip-seed=10.0.2.223:2112,10.0.2.222:2112 --int-ip=10.0.2.224

Look at your ports for gossip (2112 vs 2113!)