Scripted Deploy of Event Store on Azure VMs

I thought I would share the work I have done to automatically setup Event Store running on Azure VMs. Right now, it only targets Windows hosts, but should be easily extended for Linux. To extend for Linux, you would need to create a shell script to replace the PowerShell provisioning script shown in step #4. I have observed it takes about 20 minutes until all the VMs are running with Event Store.

https://gist.github.com/pbolduc/f8ba49358a97e1e95332

Files:

  • ProvisionEventStore.ps1 - creates Affinity Group, Storage Account, VMs and data disks for VMs

  • EventStoreScriptExtensionProvisionFile.ps1 - Run automatically on the VM to install and configure Event Store to run as a service using NSSM
    Features:

  • Creates any number of VMs for your cluster. No validation to ensure the number is odd. See -ClusterSize

  • Creates a stripped data disk using as many data disks the VM will support based on the instance size. The user specifies total target disk size in GB

  • Creates all VMs inside the same cloud service. VMs in the same cloud service can resolve the VM names to internal IP addresses

  • Uses a virtual network so nodes can user internal addresses for communication

  • Sets up all resources in one affinity group to ensure VMs and storage are close to each other in the data center

  • Creates a random storage account name to avoid conflicts (uses the user supplied prefix)
    on each VM created:

  • Formats all the data disks into a single striped volume

  • Installs Chocolatey

  • Installs NSSM using Chocolatey

  • Downloads Event Store 3.0.1 from http://download.geteventstore.com/binaries/EventStore-OSS-Win-v3.0.1.zip

  • Determines the IP addresses of the other nodes and configures the gossip seeds in the configuration file

  • Adds a service called ‘EventStore’ that will start automatically

  • Logs are written to D:\Logs\eventstore\

  • Data is stored to F:\Data\eventstore\

  • Adds firewall rules to allow Event Store traffic

  • Adds netsh urlacls for Event Store

How to use:

  1. Manually create a virtual network so that your Event Store nodes can talk to each other on private IP addresses

  2. Manually create a named subnet in your virtual network

  3. Manually create a storage account and container to host the the custom script extension

  4. Upload file EventStoreScriptExtensionProvisionFile.ps1 (found in the gist) to your custom script extension container

  5. Install the Azure PowerShell Cmdlets and ensure they are working with your subscription (see: How to install and configure Azure PowerShell)

  6. Login to your Azure account using Add-AzureAccount and/or

  7. Run ProvisionEventStore.ps1 with your desired parameters

Example Execution:

Run Add-AzureAccount to get a authorization token

$VerbosePreference = ‘Continue’

Write-Verbose “$(Get-Date -Format ‘T’) Starting Provision Environment”

. “$PSScriptRoot\ProvisionEventStore.ps1” `

-ClusterSize 3 `

-DataDiskSize 160 `

-Location "West US" `

-InstanceSize "Medium" `

-username "admin-username" `

-password "admin-password" `

-ServiceName "cloud-service-name" `

-VMName  "vm-name-prefix" `

-ImageName "a699494373c04fc0bc8f2bb1389d6106__Windows-Server-2012-R2-201502.01-en.us-127GB.vhd" `

-AffinityGroup "affinity-group-name" `

-TargetStorageAccountName "target-storage-account" `

-AvailabilitySetName "availability-set-name" `

-VNetName "virtual-network-name" `

-VNetSubnetName "subnet-name" `

-CustomScriptExtensionStorageAccountName "storage-account-name" `

-CustomScriptExtensionStorageAccountKey 'storage-account-key' `

-CustomScriptExtensionContainerName 'storage-account-container-name' `

-CustomScriptExtensionProvisionFile 'EventStoreScriptExtensionProvisionFile.ps1'

Write-Verbose “$(Get-Date -Format ‘T’) Provision Complete”

Example output:

VERBOSE: 1:54:35 PM Starting Provision Environment

VERBOSE: 1:54:35 PM Ensuring Affinity Group ‘EventStore’ exists and is in ‘West US’ location.

VERBOSE: 1:54:35 PM - Begin Operation: Get-AzureAffinityGroup

VERBOSE: 1:54:36 PM - Completed Operation: Get-AzureAffinityGroup

VERBOSE: 1:54:36 PM - Begin Operation: Get-AzureStorageAccount

VERBOSE: 1:54:37 PM - Completed Operation: Get-AzureStorageAccount

WARNING: GeoReplicationEnabled property will be deprecated in a future release of Azure PowerShell. The value will be merged into the AccountType property.

VERBOSE: 1:54:37 PM - Begin Operation: New-AzureStorageAccount

VERBOSE: 1:55:09 PM - Completed Operation: New-AzureStorageAccount

VERBOSE: 1:55:09 PM Waiting for storage account eventstoreaeugnjsexgyefy to be available…

VERBOSE: 1:55:09 PM - Begin Operation: Get-AzureStorageAccount

VERBOSE: 1:55:10 PM - Completed Operation: Get-AzureStorageAccount

WARNING: GeoReplicationEnabled property will be deprecated in a future release of Azure PowerShell. The value will be merged into the AccountType property.

VERBOSE: 1:55:12 PM Creating Virtual Machines

VERBOSE: 1:55:12 PM - Begin Operation: New-AzureService

VERBOSE: 1:55:14 PM - Completed Operation: New-AzureService

VERBOSE: 1:55:14 PM - Begin Operation: Get-AzureRoleSize

VERBOSE: 1:55:14 PM - Completed Operation: Get-AzureRoleSize

VERBOSE: 1:55:28 PM - Begin Operation: New-AzureVM - Create Deployment with VM ES-demo-1

VERBOSE: 1:56:40 PM - Completed Operation: New-AzureVM - Create Deployment with VM ES-demo-1

VERBOSE: 1:56:40 PM - Begin Operation: New-AzureVM - Create VM ES-demo-2

VERBOSE: 1:57:47 PM - Completed Operation: New-AzureVM - Create VM ES-demo-2

VERBOSE: 1:57:47 PM - Begin Operation: New-AzureVM - Create VM ES-demo-3

VERBOSE: 1:58:53 PM - Completed Operation: New-AzureVM - Create VM ES-demo-3

VERBOSE: 1:58:53 PM Provision Complete

Awesome!

I’ve done the same sort of thing for Linux boxes. I’ve been meaning to clean up the scripts and share them.

I also, as part of my deployment, have cron jobs to run scavenging and backup to blob storage.

Hopefully I can find the time to clean them up and share soon.

Cheers,

Chris

I will have to change the installation location. I was testing using a client vm and the box rebooted on me. As Microsoft says, all the files on the D:\ was lost. Currently the scrips above install stuff into D:\ and store log files on D:\ The event store data is on a separate data disk and would be safe from reboots of the node.

Not trying to nitpick and you are absolutely right in that D: is not the right drive for this, but just wanted to add that D: survives a simple reboot, it’s just a relocation of the VM to another host that kills D:.
I guess, since you said that the box rebooted itself, it must have been a maintenance relocation.

I have updated gist. For now, the only thing I am storing on the D:\ is the logs. In a proper deployment, the logs would be shipped in real-time to something like ELK.

Awesome Phil

Very cool!!

We just tried this and it worked very well! Thank you for this.

One issue we had was logging into ES though. Is it not admin / changeit ? That didn’t work for us.

That problem has disappeared. We were able to login now.

I think we’ve gotten everything working, after some tweaks to the results of running this script.

After running it, while all 3 nodes were created perfectly and ES was running just fine on each, if we killed the service on the first node, we were no longer able to browse to http://ourcloudservice:2113, but we COULD browse to 2213 and 2313.

It might be that there is a different way to achieve this, but here’s what we ended up doing to take advantage of built-in load balancer for the cloud service instances,

First, we changed the Http ExtIp to be 2113 on each of the nodes, and added that port as a balanced set endpoint (https://msdn.microsoft.com/en-us/library/azure/dn655055.aspx).

I set HttpPrefixes for 10.0.0.4:2113 and .5, and .6 respectively on the three nodes of our cluster. And, also set one to the cloud service’s DNS address.

Finally, I added port forwarding rules in each VM to map the external ports of 2013, 2213, and 2313 to 2113 so that we can independently address each node if we want to. (For node1 I had to manually add a firewall rule to allow 2013, but on the other two nodes they already had that from the script)

I tested by bringing up those direct addresses for each node, then nssm stop eventstore on each of them to verify that they went down, while the main cloudservice:2113 remained up as long as one node still had eventstore running.

We haven’t pointed our nodejs app at eventstore yet to test, but so far this is working out.

I’d be happy to do a pull request (or make a new gist) if this seems like a helpful modification.

I was having problems with that as well. One other area I was having problems with was configuring the external IP so that an external client could connect. It cannot be the load balancer IP as Event Store wants to bind to (listen on) it. The external address is also published by /gossip endpoint.

Should we create a github project for these scripts? Does that make it easier create pull requests?

normally you bind to your local ip port xxxx then your add a
httpprefix for whatever the public one is (e.g. from the load
balancer)

I have created a github repo for these scripts. For now, I have copied source directly from the gist. Please send me your changes in a pull request. I hope to integrate Greg’s comment around the http prefixes.

https://github.com/pbolduc/EventStore-DevOps

OK, great. Thanks Phil, will do this Monday.
Josh

Hi,

I’ve used these scripts successfully to deploy test ES3 clusters for (with a few tweaks though, like when you have multiple subscriptions on your account, I’ll send some PRs in the next days).

However, what I still don’t get is how clients external to the VNet can connect to the cluster using gossip seeds.
Even if 1) all nodes gossip on the external http endpoints 2) the load-balanced set routes request to the nodes, the nodes will answer with their internal IP (10.0.0.4, …) because it is what we have in the config.yaml file :

A .NET client connecting will log the following : [13,15:19:19.784,INFO] Discovering: found best choice [10.0.0.6:1313,n/a] (Master).

I thought the solution to this was to configure the “ExtIP” parameter with Virtual IP address of the cloud service (hoping that gossip infos on the external ip would use these ips).
This doesn’t work though, since the EventStore will attempt to bind on this IP (which is obviously not attributed to any Network Interface).

I’m a bit stuck with this setup, and I am thinking of taking all the nodes out of the VNet, and reserving IPs for each one (I use only a 3-node setup).

Does someone have any thoughts on this ?

Thx,
Gabriel.

There is a patch and card for this so you can assign host names to this instead of addresses and thus make the node show up as whatever. The problem is the node doesnt actually know what the client sees it as and multiple clients vould see it differently

Have not been able to update the scripts themselves, yet. I’ve been ill for the last week or so, but I did document all the steps we did manually here so far:

https://github.com/openAgile/EventStore-DevOps/blob/master/azure-powershell-windows/LoadBalancedSetSteps.md

I hope to update the script and try again soon, but trying to catch up on other stuff right now.

Ok, so I thought I had found the problem and erased last post.

There is something that hinders the virtual disk to be created. Has anyone experienced the same? What could be the problem?
The storage pool is created and is all fine and dandy. But no virtual disk created, and hence no volume F:, so ES is never installed.

I’ve put in some simple logging in EventStoreScriptExtensionProvision.ps1, like this:

$Interleave = 65536 # is this the best value for EventStore?
$Interleave | Out-File -FilePath $errorFile

$uninitializedDisks = Get-PhysicalDisk -CanPool $true
$uninitializedDisks | Out-File -FilePath $errorFile -Append

$poolDisks = $uninitializedDisks

$numberOfDisksPerPool = $poolDisks.Length
$numberOfDisksPerPool | Out-File -FilePath $errorFile -Append
   
$poolName = "Data Storage Pool"

$newPool = New-StoragePool -FriendlyName $poolName -StorageSubSystemFriendlyName "Storage Spaces*" -PhysicalDisks $poolDisks
$newPool | Out-File -FilePath $errorFile -Append
   
$virtualDiskJob = New-VirtualDisk -StoragePoolFriendlyName $poolName  -FriendlyName $poolName -ResiliencySettingName Simple -ProvisioningType Fixed -Interleave $Interleave -NumberOfDataCopies 1 -NumberOfColumns $numberOfDisksPerPool -UseMaximumSize -AsJob

Receive-Job -Job $virtualDiskJobs -Wait
Wait-Job -Job $virtualDiskJobs                        
Remove-Job -Job $virtualDiskJobs

# Initialize and format the virtual disks on the pools

$formatted = Get-VirtualDisk | Initialize-Disk -PassThru | New-Partition -AssignDriveLetter -UseMaximumSize | Format-Volume -FileSystem NTFS -Confirm:$false
$formatted | Out-File -FilePath $errorFile -Append

# Create the data directory

$formatted | ForEach-Object {

    # Get current drive letter.

    $downloadDriveLetter = $_.DriveLetter
    $downloadDriveLetter | Out-File -FilePath $errorFile -Append
   
    # Create the data directory

    $dataDirectory = "$($downloadDriveLetter):\Data"
 
    New-Item $dataDirectory -Type directory -Force | Out-Null
}

``

All output I get though is this:

65536

FriendlyName CanPool OperationalS HealthStatus Usage Size
tatus


PhysicalDisk2 True OK Healthy Auto-Select 160 GB

FriendlyName OperationalStat HealthStatus IsPrimordial IsReadOnly
us


Data Storage… OK Healthy False False

``

Something wrong with how the job is executed?
Very thankful for any ideas…

Thanks so much for this!