Spinning up an ES cluster on AWS

Hi,

Does anyone have any Packer +Terraform scripts for spinning up an ES cluster on Aws?

I know there are commercial options, but as we are currently in MVP stage, we need to produce a product before getting funding.

Any help appreciated.

Thanks,

Sean.

Hi Sean,

I recently set up a cluster on AWS. The approach i took to set up a three node cluster:

Instances and storage:

  • Each node on its own EC2 instance (i’m using t3 smalls to start – slower network, but cheap, and i don’t have the volume yet to justify anything bigger)

  • Each node has its own volume to hold ES data and logs (keep in mind that EBS volumes are AZ specific)

  • Nodes are launched using a launch template that a) automatically finds and attaches the EBS volume based on tags b) installs Docker and pulls my event store image and c) pulls scripts that configure and run EventStore via docker. (I SSH into the machines to launch manually using the scripts so that I can hard-code the gossip seeds).

  • To enable HTTPS, the scripts also launch an NGINX HTTP proxy container that terminates HTTPS and forwards the traffic to eventstore over HTTP over the local network.

Nework

  • Each node has its own small private subnet, with each subnet running in different availability zones

  • Security groups restricting internal TCP and HTTP to between the three subnets, only allowing external HTTP to my more general public and private subnets within the VPC

  • Gossip seeds hard coded to the IP addresses of the other nodes

Load balancing:

  • An Application Load Balancer distributes the external HTTP traffic between the three nodes in the cluster

  • The ALB’s target group is an IP type with the IPs of the three nodes

  • Route 53 domain fronts the ALB, so that my client applications can just post to https://eventstore-domain:exthttp-port

Took me awhile to get this to where i like it, but it works pretty well and i’m comfortable that it’s secure and somewhat reliable. Gotchas i ran into:

  • I tried setting up DNS-based cluster gossip, but ALBs don’t work for the plain TCP traffic, and I was having some trouble with NLBs. Not to mention, NLBs can route gossip traffic back to the nodes that sent the requests. So I made it easier on myself by hard-coding the IPs for gossip seeds instead of using DNS. (Would be happy to hear if anyone got this working)

  • The eventstore UI is not very happy if you try to access it through a proxy. I tried following other threads here for how to configure the proxy correctly, but still found that many of my requests were dropped with 404s. What I ended up doing was spun up small Windows instances and used RDP to connect, then from there used the browser to hit the eventstore UI.

  • Unfortunately not self healing with the hard-coded IPs, but at least we can get health check failures out of the target group.

Again, this is what I was able to get working, but if anyone has figured out better ways to configure an OSS ES cluster, i’m also happy to hear it!

-John

Hi John,

Thanks for this.

It would be good to know how to get gossip working.

Do you have any templates you’d be happy sharing?

Thanks,

Sean.

On Behalf Of John Lazos

With hard-coded IPs and the right security group config, the gossip config actually becomes pretty easy. You just set the environment variable EVENTSTORE_GOSSIP_SEED=0.0.0.0:1113,0.0.0.0:1113, replacing the 0.0.0.0 with the IPs of the other two nodes.

Also side note – they say the docker image is for development only. I personally don’t see why it should be a problem to use Docker for production. In fact i encourage it since you can do things like leverage ECR’s image scanning and have closer parity between local and prod environments.

Here is the relevant parts of a CFT you might be able to draw inspiration from. Volume creation and instances are not part of it, but it’ll get you as far as the launch template.

ESSecurityGroup:
    Type: AWS::EC2::SecurityGroup
    Properties:
      GroupName: !Sub 'event-store-sg-${Environment}' 
      GroupDescription: Rules for EventStore ports
      SecurityGroupIngress:
        - CidrIp: !FindInMap [EnvironmentMap, !Ref Environment, ExternalCidr]
          Description: External https
          IpProtocol: tcp
          FromPort: 8113
          ToPort: 8113
        - CidrIp: !FindInMap [EnvironmentMap, !Ref Environment, ExternalCidr]
          Description: SSH
          IpProtocol: tcp
          FromPort: 22
          ToPort: 22
        - CidrIp: !FindInMap [EnvironmentMap, !Ref Environment, SubnetCidrAz1]
          Description: External http
          IpProtocol: tcp
          FromPort: 2113
          ToPort: 2113
        - CidrIp: !FindInMap [EnvironmentMap, !Ref Environment, SubnetCidrAz2]
          Description: External http
          IpProtocol: tcp
          FromPort: 2113
          ToPort: 2113
        - CidrIp: !FindInMap [EnvironmentMap, !Ref Environment, SubnetCidrAz3]
          Description: External http
          IpProtocol: tcp
          FromPort: 2113
          ToPort: 2113
        - CidrIp: !FindInMap [EnvironmentMap, !Ref Environment, SubnetCidrAz1]
          Description: External tcp
          IpProtocol: tcp
          FromPort: 1113
          ToPort: 1113
        - CidrIp: !FindInMap [EnvironmentMap, !Ref Environment, SubnetCidrAz2]
          Description: External tcp
          IpProtocol: tcp
          FromPort: 1113
          ToPort: 1113
        - CidrIp: !FindInMap [EnvironmentMap, !Ref Environment, SubnetCidrAz3]
          Description: External tcp
          IpProtocol: tcp
          FromPort: 1113
          ToPort: 1113
        - CidrIp: !FindInMap [EnvironmentMap, !Ref Environment, SubnetCidrAz1]
          Description: Internal tcp
          IpProtocol: tcp
          FromPort: 1112
          ToPort: 1112
        - CidrIp: !FindInMap [EnvironmentMap, !Ref Environment, SubnetCidrAz2]
          Description: Internal tcp
          IpProtocol: tcp
          FromPort: 1112
          ToPort: 1112
        - CidrIp: !FindInMap [EnvironmentMap, !Ref Environment, SubnetCidrAz3]
          Description: Internal tcp
          IpProtocol: tcp
          FromPort: 1112
          ToPort: 1112
        - CidrIp: !FindInMap [EnvironmentMap, !Ref Environment, SubnetCidrAz1]
          Description: Internal http
          IpProtocol: tcp
          FromPort: 2112
          ToPort: 2112
        - CidrIp: !FindInMap [EnvironmentMap, !Ref Environment, SubnetCidrAz2]
          Description: Internal http
          IpProtocol: tcp
          FromPort: 2112
          ToPort: 2112
        - CidrIp: !FindInMap [EnvironmentMap, !Ref Environment, SubnetCidrAz3]
          Description: Internal http
          IpProtocol: tcp
          FromPort: 2112
          ToPort: 2112
      Tags:
        - Key: component
          Value: !Ref Component
        - Key: environment
          Value: !Ref Environment
      VpcId: !FindInMap [EnvironmentMap, !Ref Environment, VpcId]
  
  ESInstanceRole:
    Type: AWS::IAM::Role
    Properties:
      RoleName: !Sub "event-store-instance-role-${Environment}"
      AssumeRolePolicyDocument:
        Version: 2012-10-17
        Statement:
        - Effect: Allow
          Action: sts:AssumeRole
          Principal:
            Service: ec2.amazonaws.com
      Policies:
        - PolicyName: !Sub "event-store-instance-policy-${Environment}"
          PolicyDocument:
            Version: "2012-10-17"
            Statement:
              - Effect: "Allow"
                Action:
                  - "ecr:BatchCheck*"
                  - "ecr:BatchGet*"
                  - "ecr:Describe*"
                  - "ecr:Get*"
                Resource: "*"
              - Effect: "Allow"
                Action: 
                  - "ec2:DescribeVolume*"
                Resource: "*"
              - Effect: "Allow"
                Action:
                  - "ec2:AttachVolume"
                Resource: "*"
                Condition:
                  StringEquals:
                    "ec2:ResourceTag/component": event-store

  ESIamInstanceProfile:
    Type: AWS::IAM::InstanceProfile
    Properties:
      InstanceProfileName: !Sub "event-store-instance-profile-${Environment}"
      Roles:
        - !Ref ESInstanceRole

  ESLaunchTemplate:
    Type: AWS::EC2::LaunchTemplate
    DependsOn: 
      - ESSecurityGroup
      - ESEncryptionKey
    Properties:
      LaunchTemplateName: !Sub "event-store-launch-template-${Environment}"
      LaunchTemplateData:
        BlockDeviceMappings:
          - DeviceName: /dev/xvda
            Ebs:
              VolumeSize: 22
              VolumeType: standard
              DeleteOnTermination: true
        ImageId: !Ref ESInstanceAMI
        InstanceInitiatedShutdownBehavior: terminate
        IamInstanceProfile: 
          Name: !Ref ESIamInstanceProfile
        InstanceType: t3.small
        KeyName: !Sub "event-store-ssh-key-${Environment}"
        SecurityGroupIds:
          - !GetAtt ESSecurityGroup.GroupId
        TagSpecifications:
          - ResourceType: instance
            Tags:
              - Key: component
                Value: !Ref Component
              - Key: environment
                Value: !Ref Environment
        UserData:
          Fn::Base64: 
            Fn::Sub: 
              - |
                #!/bin/bash -x
                exec > /init.log 2>&1

                sudo yum -y update
                sudo yum -y install jq
                export AWS_AZ=$(ec2-metadata --availability-zone | cut -d' ' -f 2)
                export AWS_REGION=$(echo $AWS_AZ | sed 's/.$//')
                export AWS_INSTANCE_ID=$(ec2-metadata --instance-id | cut -d' ' -f 2)

                # Attach EventStore volume from this availability zone
                export ES_DEVICE=/dev/xvdb
                export ES_VOLUME_ID=$(aws ec2 --region $AWS_REGION describe-volumes --filter Name=availability-zone,Values=$AWS_AZ Name=tag-key,Values=event-store-volume-${Environment} | jq -r '.Volumes | .[] | .VolumeId')
                aws ec2 --region $AWS_REGION attach-volume --device=$ES_DEVICE --instance-id=$AWS_INSTANCE_ID --volume-id=$ES_VOLUME_ID
                aws ec2 --region $AWS_REGION wait volume-in-use --volume-ids $ES_VOLUME_ID

                # Format EventStore volume if not already formatted
                export ES_NVME_DEVICE=/dev/nvme1n1
                [ $(sudo file -s $ES_NVME_DEVICE | cut -d' ' -f 2) == "data" ] && sudo mkfs -t ext4 $ES_DEVICE

                # Create user for event store using same uid as in docker container
                export ES_UID=105
                sudo useradd -u $ES_UID -r eventstore

                # Mount EventStore volume and grant permissions
                sudo mkdir /data
                sudo mount $ES_DEVICE /data
                sudo chown eventstore:eventstore /data
                sudo mkdir -p /data/data
                sudo mkdir -p /data/logs/$AWS_AZ
                sudo chown eventstore:eventstore /data/data
                sudo chown eventstore:eventstore /data/logs
                sudo chown eventstore:eventstore /data/logs/$AWS_AZ

                # Install Docker
                sudo yum -y install docker
                sudo service docker start
                sudo $(aws ecr get-login --no-include-email --region $AWS_REGION)
                sudo docker pull account.dkr.ecr.us-east-2.amazonaws.com/eventstore:production
                sudo docker pull account.dkr.ecr.us-east-2.amazonaws.com/eventstore-proxy:production
              - {
                ClusterSize: !FindInMap [EnvironmentMap, !Ref Environment, ScalingDesiredCapacity]
                }

  ESALBSecurityGroup:
    Type: AWS::EC2::SecurityGroup
    Properties:
      GroupName: !Sub 'event-store-alb-sg-${Environment}' 
      GroupDescription: Rules for EventStore load balancer ports
      SecurityGroupIngress: 
        - CidrIp: !FindInMap [EnvironmentMap, !Ref Environment, ExternalCidr]
          Description: External https
          IpProtocol: tcp
          FromPort: 8113
          ToPort: 8113
      Tags:
        - Key: component
          Value: !Ref Component
        - Key: environment
          Value: !Ref Environment
      VpcId: !FindInMap [EnvironmentMap, !Ref Environment, VpcId]

  ESLoadBalancer:
    Type: AWS::ElasticLoadBalancingV2::LoadBalancer
    Properties: 
      Name: !Sub "event-store-alb-${Environment}"
      Scheme: internal
      SecurityGroups: 
        - !GetAtt ESALBSecurityGroup.GroupId
      Subnets: !Split [ "," , !FindInMap [EnvironmentMap, !Ref Environment, Subnets] ]
      Tags: 
        - Key: component
          Value: !Ref Component
        - Key: environment
          Value: !Ref Environment
      Type: application

  ESTargetGroupExternalHttp:
    Type: AWS::ElasticLoadBalancingV2::TargetGroup
    Properties:
      Name: !Sub event-store-exthttp-${Environment}
      Protocol: HTTPS
      Port: 8113
      TargetType: ip
      VpcId: !FindInMap [EnvironmentMap, !Ref Environment, VpcId]
      HealthCheckEnabled: true
      HealthCheckIntervalSeconds: 30
      HealthCheckProtocol: HTTPS
      HealthCheckTimeoutSeconds: 10
      HealthyThresholdCount: 3
      Matcher:
        HttpCode: 200-399
      UnhealthyThresholdCount: 3
      Tags:
        - Key: environment
          Value: !Ref Environment
        - Key: component
          Value: !Ref Component
    DependsOn: [ "ESLoadBalancer" ]
  
  ESListenerExternalHttp:
    Type: AWS::ElasticLoadBalancingV2::Listener
    Properties: 
      Certificates: 
        - CertificateArn: !Sub "arn:aws:acm:${AWS::Region}:${AWS::AccountId}:certificate/a30a9fd5-b445-4895-b4d5-607c8e30135a"
      DefaultActions: 
        - Type: forward
          TargetGroupArn: !Ref ESTargetGroupExternalHttp
      LoadBalancerArn: !Ref ESLoadBalancer
      Port: 8113
      Protocol: HTTPS
      SslPolicy: "ELBSecurityPolicy-TLS-1-2-Ext-2018-06"

Just as a small addition. Instead of configuring gossip in this way here it might be better overall to use dns. This can be setup fairly easily with the deploy and obviously would require less modifications to need to occur on change. For most tools you might be using adding dns entries should be relatively straightforward.

Cheers,

Greg

Thanks Greg!

Don’t mean to hijack Sean’s thread here… DNS was what i was trying to set up first, but couldn’t figure out how to do it behind an AWS network load balancer (the NLB being required when load balancing the TCP connections). The NLB simply didn’t seem to be forwarding the traffic (though to be fair it might have been misconfiguration on my part).

Are there loopback issues with gossiping over NLB? My understanding is the NLB will drop packets if it routes a node’s gossip request back to itself. Not sure what impact if any that’ll have on the cluster.

Like Sean, i’d also love example AWS configs, or even an architecture diagram.

Hello,

We are currently in the process of moving our entire infrastructure to AWS and decided to go a different route since we already had EKS deployed. We have a 3 node EventStore(ES) cluster running in Kubernetes(K8S). We haven’t gone into production with it yet but it’s been working well in pre-production testing so far. The K8S install/config can be found here.

https://github.com/EventStore/EventStore.Charts/blob/master/stable/eventstore/README.md

You have to use DNS for cluster gossip in this configuration because the IP address will be different for each cluster node pod every time they restart. The management of the storage and migrating data to and from the persistent volumes is quite a bit different from a traditional server configuration so it takes a little getting used to.

The load balancing is handled natively in K8S which is nice. This allows for not having to worry about managing another set of servers in our environment. It exists alongside our microservices in K8S and requires minimal additional networking considerations in our environment m. We only need to provide external access to the admin portal which is exposed natively by a K8S ingress, an optional K8S nodeport for testing against the data ports externally, and you’ll obviously need DNS for resolving this externally.

Jason

Hi Jason,

Thanks for this.

I’m interested in how you manage the storeage and volumes and how you configure gossip/ingress.

Do you have any extended charts you could share?

Thanks,

Sean.

Considering this blog post from EventStore Event Store on Kubernetes. What is the recommended way of spinning up evenstore nodes on AWS? Did you guys experience any of these problems mentioned in the blog post?