We plan to deploy to Azure (yes, yes, I know, there be demons - nobody ever said corporations make sane decisions). We’ll run as an HA cluster (within a 3 zone fault domain). The data disks will be geo-replicated so we can at least save data in the event of a data-center wide failure. We’ll stripe the disks if we have to in order to meet performance goals.
I am wondering 2 things:
Is there any way to detect corruptions early/proactively? I think I am right in saying that the chaser stuff you have would detect a corruption on write? But I am wondering what happens if a sector of the disk goes bad after this. I don’t want to be in the position of only discovering errors as we scan over that part of the log as part of day to day activity. Is this a valid concern? Would perhaps regularly reintroducing a clean 4th eventstore node to the cluster be a way to achieve this? Or do we think that the fact that writes happen across 3 azure storage nodes (both locally and remotely) represent a strong enough model?
Have striped, geo-replicated disks themselves ever proven to be the source of any issues? Seems to me that the ordering of replicated writes must be kept pretty strong across the constituent disks in the array, so I am worried about consistency post-fail-over. As the disks are knitted together locally from the point of view of the server, the replication services can’t know anything of their relationship which seems kind of flaky. Hopefully I am wrong but I would appreciate any relevant knowledge/perspectives on this.