Storing events in Avro format

Hi All,

I’m just wondering, whether it would make sense to add support for binary event types via Avro and possibly schema validation for json events.

Why binary events?
Size, nothing else. In my test avro was up-to 3 times smaller compared to JSON and the performance was half the time.

Why schema validation?
I personally prefer that a data-store check the format of the data if it is possible. It would enforce that the events are clean.

I’ve seen already, that there is some support for binary, in fact events are just a byte[], but there is this “isjson” boolean, which makes it visible for javascript and rest!

Why I think avro would be a good fit?

  • one of the smallest binary size
  • has .net library from MS!
  • has a schema validation internally
  • has “GenericRecord” concept, which makes is possible to convert it to JSON object (projections could work)

I could imagine it the following way:

  • new property “meta-content-type” and “content-type”
  • new store for schemas (immutable, append only) you cannot redefine content-types (maybe only a stream internally)
  • content-type will point to the schema, includes version of the schema and the type:
  • “+avro” would mark, that avro is in use, “+json” that it is a json schema
    eg. “myevent-v2+avro” or “myevent-v3+json”
    The schema store would contain JSON schema (http://json-schema.org/) and activated with “+json” in content type.

During processing of projections and REST calls, ES could look for the schema and do the following:

  1. Store event via REST
    GenericRecord could be created and stored as binary if “+avro” exists.
    JSON schema could be checked in case content-type exists and “+json”

  2. Store event via protobuf api - store binary, nothing changes for avro (avro already checks the format)
    In case of “+json” could validate schema.

  3. GET event via REST or projection
    If “+avro” in content type, then the data could be parsed using “GenericRecord” and the stored schema.
    In case they don’t exists, nothing changes.

  4. GET via protobuf, nothing changes

Additionally further formats could be added using the “+” notation.

Of course I can store events in avro format, just the data is not visible to projections and javascript rest clients :confused: Also because “isjson” is for both metadata and data, I’m not sure whether it is possible to have JSON meta and binary data.

Cheers,

Tamas

"Of course I can store events in avro format, just the data is not
visible to projections and javascript rest clients :confused: Also because
"isjson" is for both metadata and data, I'm not sure whether it is
possible to have JSON meta and binary data."

Yes they would only be available in avro format to http subscribers.

But why not use flatbuffers or msgpack? Once you add more, it will beget more.

"If "+avro" in content type, then the data could be parsed using
"GenericRecord" and the stored schema."

This is a whole can of worms. Let's talk about versioning schema over time ...

The likelyhood of this happening is extremely low unless someone were
to sponsor the ongoing work etc.

"If “+avro” in content type, then the data could be parsed using

“GenericRecord” and the stored schema."

This is a whole can of worms. Let’s talk about versioning schema over time …

How is this solved currently? You mentioned in one of your talks, that people solve this with weak serialization. It is not getting messy over time?
What do you see as a pitfall using strict schemas for events. (and if changes, new schema would be created)

The likelyhood of this happening is extremely low unless someone were

to sponsor the ongoing work etc.

Would you accept pull requests for it?

This is a whole can of worms. Let's talk about versioning schema over time ...

How is this solved currently? You mentioned in one of your talks, that
people solve this with weak serialization. It is not getting messy
over time?
What do you see as a pitfall using strict schemas for events. (and if
changes, new schema would be created)

ES in non-opinionated on this at this point. How you handle this is up
to you. The moment you start supporting schema there are many
decisions to make. As an example what if you want to upgrade a schema
without a shutdown? eg they want to convert old things to new things.

The likelyhood of this happening is extremely low unless someone were
to sponsor the ongoing work etc.

Would you accept pull requests for it?

Possibly but the cost of such a thing is not the initial
implementation its the ongoing support/testing which is fairly large.
Spiking the concept of schema (say with protobufs) would take me about
a week or two but the surface area it opens is probably hundreds of
weeks worth of follow up/testing

Greg

  1. március 13., vasárnap 22:25:59 UTC+1 időpontban Greg Young a következőt írta:

This is a whole can of worms. Let’s talk about versioning schema over time …

How is this solved currently? You mentioned in one of your talks, that

people solve this with weak serialization. It is not getting messy

over time?

What do you see as a pitfall using strict schemas for events. (and if

changes, new schema would be created)

ES in non-opinionated on this at this point. How you handle this is up

to you. The moment you start supporting schema there are many

decisions to make. As an example what if you want to upgrade a schema

without a shutdown? eg they want to convert old things to new things.

I think it is good so, ES should not care about it too much.
What I propose it to add some support for binary events, so projections could see inside.
There is no support for evolving schema currently, and I think it is good so. You can always crank up a projection and “copy” the stream and upgrade events on the fly.

The likelyhood of this happening is extremely low unless someone were

to sponsor the ongoing work etc.

Would you accept pull requests for
Possibly but the cost of such a thing is not the initial

implementation its the ongoing support/testing which is fairly large.

Spiking the concept of schema (say with protobufs) would take me about

a week or two but the surface area it opens is probably hundreds of

weeks worth of follow up/testing

I though of an extension point where you can add binary drivers. Community will provide the rest if needed, maybe outside of ES repo.

Cheers,

Tamas

inline.

2016. március 13., vasárnap 22:25:59 UTC+1 időpontban Greg Young a
következőt írta:

>
> This is a whole can of worms. Let's talk about versioning schema over
> time ...

How is this solved currently? You mentioned in one of your talks, that
people solve this with weak serialization. It is not getting messy
over time?
What do you see as a pitfall using strict schemas for events. (and if
changes, new schema would be created)

ES in non-opinionated on this at this point. How you handle this is up
to you. The moment you start supporting schema there are many
decisions to make. As an example what if you want to upgrade a schema
without a shutdown? eg they want to convert old things to new things.

I think it is good so, ES should not care about it too much.
What I propose it to add some support for binary events, so projections
could see inside.
There is no support for evolving schema currently, and I think it is good
so. You can always crank up a projection and "copy" the stream and upgrade
events on the fly.

Adding support for projections is pretty trivial. http will be a bit
more fun especially when embedding things. Also things like debugging
projections will be interesting.

>
> The likelyhood of this happening is extremely low unless someone were
> to sponsor the ongoing work etc.

> Would you accept pull requests for
Possibly but the cost of such a thing is not the initial
implementation its the ongoing support/testing which is fairly large.
Spiking the concept of schema (say with protobufs) would take me about
a week or two but the surface area it opens is probably hundreds of
weeks worth of follow up/testing

I though of an extension point where you can add binary drivers. Community
will provide the rest if needed, maybe outside of ES repo.

Again this is pretty easy to add. In fact the messaging protocol
already supports it! Although we expose "isJson" those who have worked
with the wire protocol before have run into this:
https://github.com/EventStore/EventStore/blob/release-v3.6.0/src/Protos/ClientAPI/ClientMessageDtos.proto#L18

Cheers,

Greg