Can Not Get Cluster Up

Chris_Martin · February 17, 2015, 11:19pm

Hey guys,

This is driving me nuts! I hope you all can help me out here.

I have three Linux machines in Azure and I can not for the life of me get them working together.

The cluster is configured like this:

I’m getting nothing but heartbeat timeouts.

I saw the other thread about this and added MONO_THREADS_PER_CPU=1000

Any help is appreciated.

Chris_Martin · February 17, 2015, 11:20pm

Logs are here

Greg_Young1 · February 17, 2015, 11:31pm

You are running Internal IP as localhost (all elections etc happen on this interface).

INT IP: 127.0.0.1 ()

Set it to your ip address.

Cheers,

Greg

Chris_Martin · February 18, 2015, 4:15am

I actually had that set to the IP before. I took it out to test. I’ll revisit this in the morning.

jen20 · February 18, 2015, 6:05pm

Be aware that configuring in Azure is a shit show unless the boxes are either in the same cloud service or virtual network thanks to it’s history as a PAAS with IAAS bolted on and still making use of “cloud services”.

Chris_Martin · February 18, 2015, 6:15pm

Currently, I’m making use of three cloud services because I couldn’t wrap my head around how to configure the cluster when the VMs where in the same service and vnet.

In that case I had but one DNS entry and the .net client was trying to call the vnet IPs after discovering them through gossip.

I’d love to get this working. Using three “cloud services” within the same VNet seems like it would work through.

Man…I’m about to call singlenode “Good enough”. Should I continue?

Thanks,

Chris

Chris_Martin · February 18, 2015, 7:08pm

OK. I have three fresh machines up and running but no cluster love still.

Can you guys take a look at my logs, please?

https://gist.github.com/4cc634aaab914f85b2d0.git

Chris_Martin · February 18, 2015, 7:09pm

Sorry. Gotta whack that .git extension.

https://gist.github.com/trbngr/4cc634aaab914f85b2d0

Chris_Martin · February 18, 2015, 7:34pm

This sure feels like the same problem with MONO_THREADS_PER_CPU.
How do I ensure that it’s set for the clusternode process?

Chris_Martin · February 18, 2015, 11:42pm

I think it’s actually the azure load balancer that’s killing.

All my endpoints are load-balanced. I’ll try again with non-load balanced endpoints when I get some more time.

Valentin_Kasas · February 19, 2015, 12:06pm

Hi guys

I’m experiencing similar issues while attempting to run a cluster on Amazon Web Services.

My setup is quite different though, I run eventstore inside a docker container using this Dockerfile.

On the AWS side, I’ve defined a security group that allows TCP connections on ports 1112, 1113, 2112 and 2113 for all machines within the group (1113 and 2113 are also accessible from the outside).

My first guess was to set int-ip to the private IP of the machine and ext-ip to its public IP but then nodes refused to start with this log. Setting both int-ip and ext-ip to the machine’s public IP raises the same error.

When I leave int-ip and ext-ip unspecified, thus falling back to 127.0.0.1, the nodes start up but fail building the cluster with the same logs that Chris observes.

I will try to skip the docker part and run the node directly on the machine’s OS to rule out a docker-related problem.

Cheers,

Valentin

Valentin_Kasas · February 19, 2015, 12:13pm

I confirm I’ve got the same “Exit reason: The requested address is not valid in this context” when running directly on the machine’s OS.

jen20 · February 19, 2015, 7:18pm

You need to bind to the LOCAL address of the box, not the public one. You may also have to add an http prefix for this.

Valentin_Kasas · February 19, 2015, 10:21pm

James,

Thanks a lot for your prompt reply. Apparently, setting both int-ip and ext-ip to the private IP of the EC2 instances seemed to solve the problem (and of course, that was the only combination I didn’t try …).

So now I’ve got my three nodes talking to each other, which is quite cool. However, scanning the logs, I see that the cluster seems to be quite unstable : at least one of the node’s dead/live status changes at almost each gossip round.

Is that something normal ? Maybe is it something I can tune by tweaking the various timeout configurations (although the defaults seem flexible enough) ? For my testing puposes I use t2.micro instances that have a pretty poor network performance, this could also explain that.

Anyway, thanks again for the tip.

Valentin

Le Thu Feb 19 2015 at 8:19:10 PM, James Nugent james@geteventstore.com a écrit :

Andrew_Browne · February 24, 2015, 4:53am

Hey Valentin,

If you are on a micro instance it might be worth setting your MONO_THREADS_PER_CPU.

See this thread from last year:

https://groups.google.com/d/msg/event-store/nl0FasMfmv0/K1vjV5-v6zwJ

cheers

Andrew

Greg_Young1 · February 24, 2015, 9:04am

It sounds like there is one node that one node can talk to and one node cant thus gossip will go dead/alive/dead/alive not exactly like this but random toggles.

jen20 · February 24, 2015, 2:59pm

Can you verify that there is bi-directional connectivity among all the machines in the cluster? Failing that if you can post the logs and check the threads per CPU setting mentioned in an earlier post (if you’re on a small box) we may be able to diagnose
more easily.

Thanks,

James

Valentin_Kasas · February 24, 2015, 3:19pm

Hi guys

For the moment, I’m assigned to another development task, but I’ve good hope I’ll be able to switch on that ES clustering thing before the end of the day (Paris time).
I’ll keep you updated of my progress as soon as possible.

Le Tue Feb 24 2015 at 3:59:29 PM, James Nugent james@geteventstore.com a écrit :

Valentin_Kasas · February 26, 2015, 2:32pm

Hi everyone,

Setting MONO_THREADS_PER_CPU=100 in the nodes’ environment seems have fixed my issues.

I now have an AWS opsworks layer with a nice cluster of eventstore nodes running on t2.micro instances (and this is quite cool !).

Each change in the cluster topology (adding and removing nodes) is correctly handled by the cluster (including master reelection when the current master goes down).

By the way, I gave up on using docker to deploy eventstore and wrote my own chef cookbook for that purpose, it may not be suited for production yet, but I would be glad to share it if anyone’s interested.

Thanks a lot for your help.

Valentin

jen20 · February 26, 2015, 3:56pm

I’d be interested in seeing what you have running - we’ve been discussing supporting AWS with “official” Event Store AMIs in the future.