EventStore in a worker role, Azure, blob

Yeah, simple one :slight_smile:
Ok, so I tried to abuse the system, no graceful shutdown needed. Three nodes up, I put thee computer to sleep. When I start up again I have
[07272,06,07:19:47.892] Error while collecting stats
The Counter layout for the Category specified is invalid, a counter of the type:
AverageCount64, AverageTimer32, CounterMultiTimer, CounterMultiTimerInverse, C
ounterMultiTimer100Ns, CounterMultiTimer100NsInverse, RawFraction, or SampleFrac
tion has to be immediately followed by any of the base counter types: AverageBas
e, CounterMultiBase, RawBase or SampleBase.

and they can’t get back to work. Closed them down, started up again, and their stuck with this:
[04812,15,07:24:44.530] Error during setting content length on HTTP response: Th
is operation cannot be performed after the response has been submitted…

"
[07272,06,07:19:47.892] Error while collecting stats
The Counter layout for the Category specified is invalid, a counter of the type:
  AverageCount64, AverageTimer32, CounterMultiTimer, CounterMultiTimerInverse, C
ounterMultiTimer100Ns, CounterMultiTimer100NsInverse, RawFraction, or SampleFrac
tion has to be immediately followed by any of the base counter types: AverageBas
e, CounterMultiBase, RawBase or SampleBase."

This is a windows thing especially on going to sleep and coming back.

lodctr /R will probably fix it (at least it usually has)

Thanks, that solved it.

Ok, so update on this.

It works. :slight_smile:

I now have 3 instances of a worker role running, they have ES installed and are running as a cluster, with one folder each with a db and log files on Azure Files. The folder to use is decided by the instance-id, so there will never be a competition for the folders.

It has been a pain really. Mostly because of lacking skillz.
Getting simple things done like running a powershell script, adding credentials to the VM etc. just kept being a hassle.
I ended up with a real hack (for now) to set the VM credentials to those of my storage account, because the mapped drive won’t be accessed without the credentials, and the cmdkey add executes, but somehow the credentials are not in the machine when I remote into it to see…
Not being able to test this without publishing is a real pain in the ¤&&/.

Allright. Having got that done, some notes on Azure Files.
I had transfer speeds of ~600kb/S.
With 3 db:s on it I think it might be a 200kb/s capacity of events+commands (if also storing commands) storage. Don’t know how many concurrent users that would support. The response has to be sent back also, so maybe it’s just 100kb/s worth of events+commands.
I hope that number gets better with bigger worker roles (running xtra small ones now).

So, will have a durable subscription to get a backup at other location.

Will continue with load balancing calls to the instances, and then I’m gonna learn how to use ES, don’t know squat about it yet :smiley:

I think you’re right in that the choice of your role size is the limiting factor, as XS instances are limited to 5 Mbps bandwidth

nmehlei, I’ll try with some really big ones soon and see what difference it makes :slight_smile:

Ok.
So, how can this data be interpreted, what’s the health of the system?

Node 1:

[PID:00772:005 2015.03.06 16:22:30.599 TRACE QueuedHandlerThreadP] SLOW QUEUE MSG [MonitoringQueue]: GetFreshStats - 142ms. Q: 0/0.
[PID:00772:006 2015.03.06 16:40:30.766 TRACE QueuedHandlerThreadP] SLOW QUEUE MSG [MonitoringQueue]: GetFreshStats - 140ms. Q: 0/0.
[PID:00772:007 2015.03.06 16:45:58.879 TRACE QueuedHandlerMRES ] SLOW QUEUE MSG [StorageWriterQueue]: DataChunkBulk - 833ms. Q: 0/0.
[PID:00772:016 2015.03.06 16:47:49.791 TRACE QueuedHandlerThreadP] SLOW QUEUE MSG [MonitoringQueue]: GetFreshStats - 161ms. Q: 0/0.
[PID:00772:019 2015.03.06 17:00:12.725 TRACE QueuedHandlerThreadP] SLOW QUEUE MSG [MonitoringQueue]: GetFreshStats - 109ms. Q: 0/0.
[PID:00772:006 2015.03.06 17:02:11.747 TRACE QueuedHandlerThreadP] SLOW QUEUE MSG [MonitoringQueue]: GetFreshStats - 129ms. Q: 0/0.
[PID:00772:016 2015.03.06 17:02:23.721 TRACE QueuedHandlerThreadP] SLOW QUEUE MSG [MonitoringQueue]: GetFreshStats - 113ms. Q: 0/0.
[PID:00772:012 2015.03.06 17:02:23.749 TRACE InMemoryBus ] SLOW BUS MSG [manager input bus]: RegularTimeout - 77ms. Handler: ProjectionCoreCoordinator.
[PID:00772:012 2015.03.06 17:02:23.749 TRACE QueuedHandlerMRES ] SLOW QUEUE MSG [Projections Master]: RegularTimeout - 77ms. Q: 0/1.
[PID:00772:016 2015.03.06 17:05:33.714 TRACE QueuedHandlerThreadP] SLOW QUEUE MSG [MonitoringQueue]: GetFreshStats - 109ms. Q: 0/0.
[PID:00772:019 2015.03.06 17:09:28.718 TRACE QueuedHandlerThreadP] SLOW QUEUE MSG [MonitoringQueue]: GetFreshStats - 113ms. Q: 0/0.
[PID:00772:023 2015.03.06 17:12:19.802 DEBUG HttpEntityManager ] Close connection error (after crash in read request): The parameter is incorrect
[PID:00772:023 2015.03.06 17:12:19.802 DEBUG GossipController ] Error while reading request (gossip): The I/O operation has been aborted because of either a thread exit or an application request
[PID:00772:019 2015.03.06 17:12:30.466 DEBUG HttpEntityManager ] Error during setting content length on HTTP response: This operation cannot be performed after the response has been submitted…
[PID:00772:006 2015.03.06 17:14:59.720 TRACE QueuedHandlerThreadP] SLOW QUEUE MSG [MonitoringQueue]: GetFreshStats - 109ms. Q: 0/0.
[PID:00772:017 2015.03.06 17:20:19.787 TRACE QueuedHandlerThreadP] SLOW QUEUE MSG [MonitoringQueue]: GetFreshStats - 171ms. Q: 0/0.
[PID:00772:016 2015.03.06 17:20:41.743 TRACE QueuedHandlerThreadP] SLOW QUEUE MSG [MonitoringQueue]: GetFreshStats - 109ms. Q: 0/0.
[PID:00772:020 2015.03.06 17:24:36.761 TRACE QueuedHandlerThreadP] SLOW QUEUE MSG [MonitoringQueue]: GetFreshStats - 171ms. Q: 0/0.
[PID:00772:016 2015.03.06 17:27:04.763 TRACE QueuedHandlerThreadP] SLOW QUEUE MSG [MonitoringQueue]: GetFreshStats - 156ms. Q: 0/0.

Node 2:

[PID:02604:007 2015.03.06 17:23:34.365 TRACE QueuedHandlerMRES ] SLOW QUEUE MSG [StorageWriterQueue]: DataChunkBulk - 1937ms. Q: 0/0.

Node 3:

(This is a bit older data, from start up)
[PID:01104:010 2015.03.06 16:16:30.436 ERROR UserManagementServic] ‘admin’ user account could not be created
[PID:01104:011 2015.03.06 16:16:30.748 FATAL ProjectionManager ] Cannot initialize projections subsystem. Cannot write a fake projection
[PID:01104:011 2015.03.06 16:16:32.001 ERROR ProjectionManager ] The ‘$by_category’ projection faulted due to ‘Unexpected ‘PREPARED’ message in PreparedState’
[PID:01104:011 2015.03.06 16:16:32.001 ERROR ProjectionManager ] The ‘$streams’ projection faulted due to ‘Unexpected ‘PREPARED’ message in PreparedState’
[PID:01104:011 2015.03.06 16:16:32.001 ERROR ProjectionManager ] The ‘$users’ projection faulted due to ‘Unexpected ‘PREPARED’ message in PreparedState’
[PID:01104:011 2015.03.06 16:16:32.001 ERROR ProjectionManager ] The ‘$stream_by_category’ projection faulted due to ‘Unexpected ‘PREPARED’ message in PreparedState’
[PID:01104:011 2015.03.06 16:16:32.001 ERROR ProjectionManager ] The ‘$by_event_type’ projection faulted due to ‘Unexpected ‘PREPARED’ message in PreparedState’

It managed to do all these things shortly after.

Currently in main log:
[PID:01104:010 2015.03.06 16:20:24.880 TRACE InMemoryBus ] SLOW BUS MSG [MainBus]: GossipReceived - 49ms. Handler: NodeGossipService.
[PID:01104:010 2015.03.06 16:20:24.880 TRACE QueuedHandlerMRES ] SLOW QUEUE MSG [MainQueue]: GossipReceived - 49ms. Q: 0/1.
[PID:01104:006 2015.03.06 16:22:44.988 TRACE QueuedHandlerThreadP] SLOW QUEUE MSG [StorageReaderQueue #1]: ReadAllEventsBackward - 263ms. Q: 0/0.
[PID:01104:005 2015.03.06 16:22:54.243 TRACE QueuedHandlerThreadP] SLOW QUEUE MSG [StorageReaderQueue #3]: ReadAllEventsBackward - 220ms. Q: 0/0.
[PID:01104:010 2015.03.06 16:36:33.976 TRACE InMemoryBus ] SLOW BUS MSG [MainBus]: GossipReceived - 128ms. Handler: NodeGossipService.
[PID:01104:010 2015.03.06 16:36:33.976 TRACE QueuedHandlerMRES ] SLOW QUEUE MSG [MainQueue]: GossipReceived - 128ms. Q: 0/1.

And, how can I know which db is master if all go down, when they are all named chunk-000000.000000 ? This X and Y thing doesn’t seem to work here?

So, if I want to restart everything, or they crash, can I trust that they will just pick up the db’s and continue like nothing happened?

When I do a swap from stage to production, should I turn off the nodes first somehow?

Go to any node/gossip and it will give you status

The freshstats being slow is no issue.

The only worrying one is the projections errors/user error from previously

Go to any node/gossip and it will give you status

Gossip eavesdropping worked until one was down. So, I’m just thinking if they are down, how could I determine just from the files which one is The Db?
I’d probably have the logs to query for who was master and rely on that, but I read somewhere of some X and Y pattern on the chunk-file, not present on my db:s though.

The freshstats being slow is no issue.

Great!

The only worrying one is the projections errors/user error from previously

Even though it managed to perform all those things shortly after?

Go to any node/gossip and it will give you status

Gossip eavesdropping worked until one was down. So, I'm just thinking
if they are down, how could I determine just from the files which one
is The Db?
I'd probably have the logs to query for who was master and rely on
that, but I read somewhere of some X and Y pattern on the chunk-file,
not present on my db:s though.

Once you have gossip just start picking a random node to ask (you have
all the nodes listed). You should never be looking at files

Maybe it was an error, but I couldn’t pick them when not all instances where alive. That’s why I thought that, in a worst case scenario where they just don’t connect again, I would have a way to determine it from the files.
I’ll set up a ClusterMonitor class that gets the gossip json at intervals and deserializes the members to custom type ClusterInstance.
Anyway. I’ve restarted the role instances a couple of times, and the cluster reestablishes when they are back up again. So it works for that part too.
So, I never got the things to run well with powershell as a startup task, so it’s all being installed from OnStart() instead.
If anyone wants to try, this is how I did it:
ES Cluster on multi instance Worker Role

"Maybe it was an error, but I couldn't pick them when not all
instances where alive. That's why I thought that, in a worst case
scenario where they just don't connect again, I would have a way to
determine it from the files."

Define "pick them"

You should never ever be looking at files.

Pick them: http://ip:port/gossip
I’m talking about manually looking at a file - only if I can’t get ES to run and determine for me which of the dbs was the master. As long as ES will run and process data, there’s no need for it. Just occurred to me that if I somehow get an unhealthy cluster that refuses to come alive again, I would like to know which of the dbs to save.

So this is the first code where it worked. Nothing refined, nothing improved (also, without error handling and removing some basic stuff).

    public override bool OnStart()
    {
           _slsPath = RoleEnvironment.GetLocalResource("StartupLocalStorage").RootPath;

            // Retrieve storage account from connection string
            var settingstring = CloudConfigurationManager.GetSetting("StorageConnectionString");
            CloudStorageAccount storageAccount = CloudStorageAccount.Parse(settingstring);
       
            var blobClient = storageAccount.CreateCloudBlobClient();
            CloudBlobContainer container = blobClient.GetContainerReference("zipfiles");
            ICloudBlob blob = container.GetBlobReferenceFromServer("EventStore.zip");
           
            //copy blob from cloud to local gallery
            blob.DownloadToFile(_slsPath + "EventStore.zip", FileMode.Create);

            ZipFile.ExtractToDirectory(_slsPath + @"\EventStore.zip", _slsPath);

            var share = storageAccount
            .CreateCloudFileClient()
            .GetShareReference("nameofshare");
            share.CreateIfNotExists();

}

Pick them: http://ip:port/gossip

Isnt it more likely that where that code is running is paritioned from
the nodes?

I'm talking about manually looking at a file - only if I can't get ES
to run and determine for me which of the dbs was the master. As long
as ES will run and process data, there's no need for it. Just occurred
to me that if I somehow get an unhealthy cluster that refuses to come
alive again, I would like to know which of the dbs to save.

You should never ever look at the files. There is no valid use case
for you to be doing this.

Allright, so I pick to find a node still being up, and deduct who was master when things started go wrong. Ok, no use case for that. Maybe there will always be other ways to find the masterdb even when files starts to go corrupt and so on (not an entirely alien scenario), I guess it will always be deductible from the logs.

Anyway, the projections are all in a Preparing state, even though it has been running for more than 3 days, and restarted a couple of times.
After restart there is this:

[PID:01104:011 2015.03.08 22:46:18.593 ERROR ProjectionManager ] The ‘$by_category’ projection faulted due to ‘Unexpected ‘PREPARED’ message in PreparedState’
[PID:01104:011 2015.03.08 22:46:18.845 ERROR ProjectionManager ] The ‘$streams’ projection faulted due to ‘Unexpected ‘PREPARED’ message in PreparedState’
[PID:01104:011 2015.03.08 22:46:19.126 ERROR ProjectionManager ] The ‘$users’ projection faulted due to ‘Unexpected ‘PREPARED’ message in PreparedState’
[PID:01104:011 2015.03.08 22:46:19.227 ERROR ProjectionManager ] The ‘$stream_by_category’ projection faulted due to ‘Unexpected ‘PREPARED’ message in PreparedState’
[PID:01104:011 2015.03.08 22:46:19.227 ERROR ProjectionManager ] The ‘$by_event_type’ projection faulted due to ‘Unexpected ‘PREPARED’ message in PreparedState’

Reinstalling only option left? Also deleting the dbs?

Have you tried resetting them from the UI/Restful api?

"Allright, so I pick to find a node still being up, and deduct who was
master when things started go wrong. Ok, no use case for that. Maybe
there will always be other ways to find the masterdb even when files
starts to go corrupt and so on (not an entirely alien scenario), I
guess it will always be deductible from the logs."

If none of your nodes are up I think you might realize that. My guess
is if all your nodes are down you probably can't automate handling
that anyways

You mean disable and enable the projections?

I tried that with the master, only got errors (a bunch of both of these, tried a couple of times :slight_smile: ):

[PID:01104:011 2015.03.09 12:31:25.797 ERROR QueuedHandlerMRES ] Error while processing message EventStore.Projections.Core.Messages.ProjectionManagementMessage+Command+Disable in queued handler ‘Projections Master’.
System.NotSupportedException: Specified method is not supported.
at EventStore.Projections.Core.Services.Management.ManagedProjection.StopUnlessPreparedOrLoaded() in c:\EventStore\src\EventStore.Projections.Core\Services\Management\ManagedProjection.cs:line 837
at EventStore.Projections.Core.Services.Management.ManagedProjection.Handle(Disable message) in c:\EventStore\src\EventStore.Projections.Core\Services\Management\ManagedProjection.cs:line 401
at EventStore.Projections.Core.Services.Management.ProjectionManager.Handle(Disable message) in c:\EventStore\src\EventStore.Projections.Core\Services\Management\ProjectionManager.cs:line 269
at EventStore.Core.Bus.MessageHandler`1.TryHandle(Message message) in c:\EventStore\src\EventStore.Core\Bus\MessageHandler.cs:line 33
at EventStore.Core.Bus.InMemoryBus.Publish(Message message) in c:\EventStore\src\EventStore.Core\Bus\InMemoryBus.cs:line 324
at EventStore.Core.Bus.QueuedHandlerMRES.ReadFromQueue(Object o) in c:\EventStore\src\EventStore.Core\Bus\QueuedHandlerMRES.cs:line 121

[PID:01104:011 2015.03.09 12:31:34.565 ERROR QueuedHandlerMRES ] Error while processing message EventStore.Projections.Core.Messages.ProjectionManagementMessage+Command+Enable in queued handler ‘Projections Master’.
System.NotSupportedException: Specified method is not supported.
at EventStore.Projections.Core.Services.Management.ManagedProjection.StopUnlessPreparedOrLoaded() in c:\EventStore\src\EventStore.Projections.Core\Services\Management\ManagedProjection.cs:line 837
at EventStore.Projections.Core.Services.Management.ManagedProjection.Handle(Enable message) in c:\EventStore\src\EventStore.Projections.Core\Services\Management\ManagedProjection.cs:line 432
at EventStore.Projections.Core.Services.Management.ProjectionManager.Handle(Enable message) in c:\EventStore\src\EventStore.Projections.Core\Services\Management\ProjectionManager.cs:line 285
at EventStore.Core.Bus.MessageHandler`1.TryHandle(Message message) in c:\EventStore\src\EventStore.Core\Bus\MessageHandler.cs:line 33
at EventStore.Core.Bus.InMemoryBus.Publish(Message message) in c:\EventStore\src\EventStore.Core\Bus\InMemoryBus.cs:line 324
at EventStore.Core.Bus.QueuedHandlerMRES.ReadFromQueue(Object o) in c:\EventStore\src\EventStore.Core\Bus\QueuedHandlerMRES.cs:line 121

And then I shut the server down from the UI, but there’s no restart button?

"Have you tried resetting them from the UI/Restful api?"

When I click on a projection in UI there are buttons for operations to
be done on them. Start/stop/etc. Try "stop" then "reset" (they are
buttons in the upper right hand corner).

Greg

They cannot be stopped.

then click reset

I get “reset failed”, on all the projections.

I wonder what could have gone wrong here. Maybe best just delete it all, redeploy and see if it comes up again?