Be aware that configuring in Azure is a shit show unless the boxes are either in the same cloud service or virtual network thanks to it’s history as a PAAS with IAAS bolted on and still making use of “cloud services”.
Currently, I’m making use of three cloud services because I couldn’t wrap my head around how to configure the cluster when the VMs where in the same service and vnet.
In that case I had but one DNS entry and the .net client was trying to call the vnet IPs after discovering them through gossip.
I’d love to get this working. Using three “cloud services” within the same VNet seems like it would work through.
Man…I’m about to call singlenode “Good enough”. Should I continue?
I’m experiencing similar issues while attempting to run a cluster on Amazon Web Services.
My setup is quite different though, I run eventstore inside a docker container using this Dockerfile.
On the AWS side, I’ve defined a security group that allows TCP connections on ports 1112, 1113, 2112 and 2113 for all machines within the group (1113 and 2113 are also accessible from the outside).
My first guess was to set int-ip to the private IP of the machine and ext-ip to its public IP but then nodes refused to start with this log. Setting both int-ip and ext-ip to the machine’s public IP raises the same error.
When I leave int-ip and ext-ip unspecified, thus falling back to 127.0.0.1, the nodes start up but fail building the cluster with the same logs that Chris observes.
I will try to skip the docker part and run the node directly on the machine’s OS to rule out a docker-related problem.
Thanks a lot for your prompt reply. Apparently, setting both int-ip and ext-ip to the private IP of the EC2 instances seemed to solve the problem (and of course, that was the only combination I didn’t try …).
So now I’ve got my three nodes talking to each other, which is quite cool. However, scanning the logs, I see that the cluster seems to be quite unstable : at least one of the node’s dead/live status changes at almost each gossip round.
Is that something normal ? Maybe is it something I can tune by tweaking the various timeout configurations (although the defaults seem flexible enough) ? For my testing puposes I use t2.micro instances that have a pretty poor network performance, this could also explain that.
Anyway, thanks again for the tip.
Valentin
Le Thu Feb 19 2015 at 8:19:10 PM, James Nugent [email protected] a écrit :
It sounds like there is one node that one node can talk to and one node cant thus gossip will go dead/alive/dead/alive not exactly like this but random toggles.
Can you verify that there is bi-directional connectivity among all the machines in the cluster? Failing that if you can post the logs and check the threads per CPU setting mentioned in an earlier post (if you’re on a small box) we may be able to diagnose
more easily.
For the moment, I’m assigned to another development task, but I’ve good hope I’ll be able to switch on that ES clustering thing before the end of the day (Paris time).
I’ll keep you updated of my progress as soon as possible.
Le Tue Feb 24 2015 at 3:59:29 PM, James Nugent [email protected] a écrit :
Setting MONO_THREADS_PER_CPU=100 in the nodes’ environment seems have fixed my issues.
I now have an AWS opsworks layer with a nice cluster of eventstore nodes running on t2.micro instances (and this is quite cool !).
Each change in the cluster topology (adding and removing nodes) is correctly handled by the cluster (including master reelection when the current master goes down).
By the way, I gave up on using docker to deploy eventstore and wrote my own chef cookbook for that purpose, it may not be suited for production yet, but I would be glad to share it if anyone’s interested.