Azure and Clustering

I keep hearing how bad Azure is for clustering ES. If the instances where all on a Azure VNET and only API Endpoints which are other instance members of the VNET domain are exposed to the outside world then that should work or am I missing something?

Before I go setting up a test cluster for development purposes I was hoping to save some headaches by asking the group.


Jason Wyglendowski

It’s not bad for clustering ES specifically, it’s bad for clustering any quorum system, including eg ZooKeeper, Consul etc. The reason for this is it only has two fault domains (availability zones in Amazon-speak). Consequently if the wrong side goes down
you will find yourself with only a minority of nodes available.

Within a virtual network things are just IP Addressable so there are no issues beyond this. External access is a mess because of the DNS and HTTP prefix requirements.

Can you describe what you’re hoping to achieve?


I have a stream of events coming in from a game so where in the range of one per second per user in 20 to 30 minute intervals per game session. Right now those events are going into MongoDB though a web service then they are read off asynchronously and the results are ultimately stored in SQL. What I am looking to demonstrate is that I can replace everything related to the game events after the web service with ES. I want to prove the es projections can handle the load and offer a superior scaling solution to the current method.

I believe that with es it will also be easier to switch out other pieces of our stack liked SQL server if we properly apply projections for handling the stats generated from the events.


Jason Wyglendowski

Hey, I was researching a different issue and just happened upon these links that I thought might help you out. It’s all a bit over my head (being new to azure) but it appears like there might be some headway made in forcing a 3 FD cluster. In this link a guy (?from microsoft?) mentions that they just released this feature within the last few weeks:

However, the second link in his comment that is supposed to link to an example of a 3FD setup doesn’t work anymore. However, I was able to find the document using Google’s cached page:

Which, the above page just shows an example of using an Azure Resource Manager Template from the following github:

I’ve never used ARM or it’s templates. And a commenter on the second page above was having issues getting it working. But I thought I would send it your way in case you could make heads or tails of the additional information.

Indeed it does now look like that might be possible. I’ll investigate and post back in this thread.

Thanks Tim for the information it’s appreciated.


Jason Wyglendowski

The cached link not working anymore.
This might shed some light on the procedure, with a bit of tweaking:

The repo here has a template: though I’ve not tested it. Now if only we could sort out the disk latency Azure might be an actual legit option!

okay, so just wondering now that you mention it James, what is the best option for clustering right now?

On what platform?

Until disk latency is sorted out, any kind of solution that wouldn’t involve Azure I guess.
On what kind of setup do you have the most customers/best experiences?

Is latency still an issue with

It’s possible to get reasonable performance out of Azure, but the tail latencies always seem to be a mess. Since they’ve finally made three fault domains available you can try running on instance storage and things are mostly OK.

But as usual with Azure, the optimal route to success is to just use AWS.

James, what would you regard as reasonable performance? (as in number of stored events per second, or any measurement)

With instance storage, what VM and db sizes have you experienced that it’s mostly OK?

The overall performance is not too bad. Its the tail latency you have
to watch for (e.g. 99.9%-99.99%)

Okay, I get it.

I’ve seen databases running on D4 machines of ~300GB without too many problems.

I have been reading the free ebook Microsoft Azure Essentials - Fundamentals of Azure they provided a reference to a white paper which was originally written for SQL Server but the author of the EBook said that the disk configuration discussed within the article should be followed for any systems needing excellent performance. Is there any pitfalls following it’s advice?


Jason Wyglendowski

Not sure on the specifics, apparently the best throughput can be obtained with DS-type instances and a ton of disks striped together. Personally I’d just use the instance store now that there are three availability zones (well, not really AZs, but fault domains - not quite as good but I guess we have to take what we can when it comes to Azure).

Can always dump one node outside as an async replica "just in case"

This will work well until data sizes get large