So for kicks, I’m working on an exponential problem as a personal project. I’m processing chunks of binary data (at most 26 bytes), which spawns more chunks of binary data to process.
Right now I’m using MSSQL with a uniqueness constraint on the binary data. It’s working fine, but it’s a bit slow for what I need. The average is actually decent for SQL (400-500 writes per second), but the problem space is too large for that speed.
I had the idea that I could use EventStore to speed up the write process, but in order to do that I would need to use streams in … uh … unconventional ways. I wanted to do a quick boundary check with you guys.
There are too many results to fit in memory, which is why I put the unique constraint on the database. I need the uniqueness constraint on the binary data because it eliminates more than half of the results (and growing) at each step. The way I thought to have a unique constraint in event store is to embed it in the ID. I can do this pretty easily in .NET by using System.Data.Linq.Binary class. ToString() on that looks something like this:
example 1: vwodd9pR1Az+6k9H9LoMnQmZmDkk0t/wh3Q=
example 2: rxpq5UOFobWZA2V+
So my stream IDs would have to be like “state-rxpq5UOFobWZA2V+”. First off is that even doable (are special characters allowed)? I don’t think I will run into a length limitation (longest would be prefix + 36 characters).
Secondly, the sheer number of streams I would need to create is ridiculous. For instance, I looked at Neo4j but the billions of nodes limit is likely to block me eventually.
This is just a fun project for me, so no biggie. But I thought I would ask about known limitations before trying it.
Base 64 encode the ID part of the stream name - apart from anything you’ll hate life if you try and access URIs with special characters over HTTP! I’m not sure how you’re calculating a unique constraint here - some hash of the values? If so you can rely on event store to give you idempotent writes if you use the correct expected version numbers.
New streams do not have a cost unless you set ACLs etc for them. There is a limit on the number of events in a stream of 2^31 however. This does not apply to the $all stream which is implemented differently underneath.
In MSSQL, I have a UNIQUE constraint on the VARBINARY(26) field itself. I believe MSSQL has O(1) lookups on unique values, but I don’t recall the specific implementation. I generate new binary data which might not be unique, so I check it inside a stored procedure (IF NOT EXISTS …) to avoid round tripping twice and return a true/false as to whether it was inserted. I’m serializing writes using an agent to avoid concurrency issues on writes.
I can’t keep any sort of hashset in memory to check uniqueness because I will easily run into the 2GB per object limit. Even if I use the large object heap, I will eventually just run out of memory. (Been there already without it, filled up 12GB between OS, DB, and this program.)
I don’t expect many events per stream. In fact, if there was just a way to attempt a write only if the stream does not exist, I would use that. (Not a use case anybody else needs, I’m sure!) I don’t want to slow things down by catching an exception on expected version mismatch or pre-reading the stream. Of course, I should measure to make sure of the round-trip cost of reading the stream before saving (basically just checking if it exists). I could blindly append an empty event to the stream, ignoring the expected version. The existence of the stream is all I strictly need. Although it would be nice to write in the event the stream id which caused this stream to be created for tracing.
So the remaining question is what is the limit on number of streams? Documentation mentions millions are within design targets. My use case would stretch that a bit. In the current db, I am already close to 100 million unique values.
I can’t keep any sort of hashset in memory to check uniqueness because I will easily run into the 2GB per object limit. Even if I use the large object heap, I will eventually just run out of memory. (Been there already without it, filled up 12GB between OS, DB, and this program.)
Unmanaged memory? Could also optimize this somewhat with a bloom filter or something.
I don’t expect many events per stream. In fact, if there was just a way to attempt a write only if the stream does not exist, I would use that. (Not a use case anybody else needs, I’m sure!)
You can do that - ExpectedVersion 0.
Although it would be nice to write in the event the stream id which caused this stream to be created for tracing.
You can put that in the metadata of the event
So the remaining question is what is the limit on number of streams? Documentation mentions millions are within design targets. My use case would stretch that a bit. In the current db, I am already close to 100 million unique values.
The subsequent event will have different event data (if I store the parent id, because it was generated by a different parent), so I believe it will throw WrongExpectedVersionException instead of doing an idempotent write. In which case, more than half of the writes will result in an exception, which will slow things down a bit. I’ll probably have to read the stream first to check for existence. In SQL land, I made a stored procedure to check existence before inserting so I didn’t have to round-trip any read results before the insertion.
In any case, I’m definitely going to give it a try. Thanks!