Hi Sean
We are using m4-large but this will all depend on your load.
We are also putting nodes across availability zones for redundancy.
For dealing with a failed node we make use of a few things. We use DNS(Route53) to allow ES to find the nodes and cluster. We also have to process completely scripted so that when we do lose a node the scripts kicks in to recreate the node and add it to the Cluster.
We do rely on ES to seed the new node so we do not do any restore to the new node.
We have tested this and it works well. We can get the cluster back up with all the nodes in about 15 min from the Cluster port going down on a slow AWS day.
We are looking now to make the region resilient by running a secondary cluster, but still working through the details on how to make sure the master don’t move the secondary cluster nodes.
Hope this help.
Chris