Hacker Newsnew | past | comments | ask | show | jobs | submit | more daddykotex's commentslogin

I've had the question for a while so I'll ask it here, maybe someone can help me.

Suppose you modeled your domain with events and your stack is build on top of it. As stuff happens in your application, events are generated and appended to the stream. The stream is consumed by any number of consumers and awesome stuff is produced with it. The stream is persisted and you have all events starting from day 1.

Over time, things have changed and you have evolved your events to include some fields and deprecate others. You could do this without any downtime whatsoever by changing your events in a way that is backward compatible way.

What is the good approach to what I'd call a `replay`?

When you want to replay all events, the version of your apps that will consume the events may not know about the fields that were in the event for day one.


As always in these types of the scenarios, the answer is: it depends. It depends on the amount of data you have. It depends upon how big the diversion from the original schema is. Etcetera.

My personal philosophy is to always leave event data at rest alone: data is immutable, you don't convert it, and you treat it like a historical artifact. You version each event, but never convert it into a new version in the actual event store. Any version upgrades that should be applied are done when the event is read; this requires automated procedures to convery any event version N to another version N + 1, but having these kind of procedures in place is good practice anyway. Some might argue that doing this every time an event is read is a waste of CPU cycles, but in my experience this far outweights possible downsides of losing the actual event stored at that time in the past, and this type of data is accessed far less frequently than new event data.


I suppose you can always trade those CPU cycles off against storage and cache the N+1 version (in a separate Kafka topic or elsewhere), so now reading the latest-version data is fast, yet you still retain the original data intact, at the expense of more storage. This does complicate the storage though, as you now have multiple days a sources, but nothing that can't be solved.


Bingo. If you equate this technique to a DB migration, you could have "up" and "down" directions for the translations from version N <=> N+1.

Then if you have 90% confidence you'll only ever need to replay the upgraded stream, you can upgrade it and destroy the previous version.

If at some point (the remaining 10%) you need to rescue the old stream, you can run the "down" direction and rehydrate the old version of the stream.


It sounds good in theory. In practice, I haven't heard much around running backwards migrations on a data warehouse / massive collection of events but I'm sure some out there already do it.


I suppose one needs to take care that migrations are never lossy so that the full information for upgrading or downgrading a version is available.


Yeah, that's the challenge. For instance, how do you handle when a column was one data type but then down the road was changed to another type when the two aren't cross compatible or could potentially break?


You could retain this info in a meta field of flexible type. For a DB, it could a JSON type. For messages, it could be an extra _meta field on the message that the systems themselves ignore.


> this requires automated procedures to convery any event version N to another version N + 1, but having these kind of procedures in place is good practice anyway

Note that this "good practice" already has a name, it is usually called "migrations".

Migrations may be as simple as SQL Update / ALTER TABLE statements. But they may also be transformation of JSON/XML/... structures, or any other complex calculation.

It may not be the best term, as "migrations" usually imply that the result is written back to the data store. But apart from that, I don't see any problem using this term here.


Ok this makes sense.

It matches what some others have been saying as well. Thanks


There's a whole world out there about this kind of stuff. Take a look at CQRS and some of the posts by Greg Young; they're highly informative and one of the first people to really capture this way of dealing with data properly.


I wonder what happened to Greg Young. I think he had a book in the pipeline (which I was looking forward to), but as far as I can glean from social media, some burnout related stuff happened.


Here's a pretty good, even if a bit too verbose, explanation of various issues and solutions related to event versioning: https://leanpub.com/esversioning/read.

The text is written by Greg Young - the lead on the EventStore [1] project.

[1] https://geteventstore.com/


thank you


I encountered that problem. The ad hoc fix, was to have a version field in each event and functions that translate the old event into new event(s). The code that processes the events only processes events of the current version. If your old events had been denormalized this might result into repetition of events when splitted.


To add to that, you can treat it like you would schema migration on databases: implement v1 to v2, v2 to v3, etc... and replay the migrations in order to migrate from whatever version of the event is to the latest version. This allows keeping event migration code as immutable as the event versions it migrates between.


Ok, thanks for the pointer


I've often wondered the same.

In data warehousing, particularly the Kimball methodology, if descriptive attributes are missing from dimensions, for example, it is common to represent them using some standard value, like, "Not Applicable" or "Unknown" for string values. For integers, one might use -1 or similar. For dates it might be a specially chosen token date which means "Unknown Date" or "Missing Date".

It doesn't solve the problem of truly unknown missing information, but it at least gives a standard practice for labeling it consistently. Think of trying to do analytics on something where the value is unknown?? Not too easy, but at least it is all in one bucket.

Certainly, if past values can be derived, even if they were not created at the time the data was originally created, that is one way of "deriving" the past when it was missing. But, otherwise, I don't think there is any other way to make up for the lack of data/information.


The best way I've seen it done, is to version all of your schemas and have a database that signals all the transformations needed to be done for any given schema version. That way, when your reading a particular event, you can query for all the operations needed to be done on such event and perform them.



What you are referring to is called a "temporal database". Your specific example is called a "bitemporal database".

https://en.wikipedia.org/wiki/Temporal_database

https://en.wikipedia.org/wiki/Bitemporal_Modeling


If the fields are changing then you effectively have DDL and migrations in your code already... so decouple them and version the schema officially. Then record these schema changes as events in the same event stream.

Build a view on against these schema change events as a table of schema version by timestamp to allow for parsing any arbitrary event.


Recently asked on the Kafka users mailing list https://lists.apache.org/thread.html/82692004eb2292e1240c339...


Thanks for sharing, it makes me realize that our messages are not independent.


I highly recommend you look into gRPC.

Building apps using event sourcing, CQRS and microservices can easily become hell if the data models are not thought through.


I thought RPC and CQRS are diametrically opposite patterns. (Although you can use RPC in a CQRS context, but only as a transport/encapsulation layer (so the response says "request queued" or "error", but does not divulge domain specific information ("item created", "item not created")).


I'm also wondering: how one deal with changes to events when using KSQL?


Mind clarifying what you mean by changes to events?

If you create a STREAM or TABLE using KSQL, it makes sure that they are kept updated with every single event that arrives on the source Kafka topics.

That's what you'd expect in a true event-at-a-time Streaming SQL engine, which is what KSQL is.


Suppose you build a STREAM or TABLE from a topic and assume a field in the event is `id`. Later on, you introduce an update to this event where where your replace `id` by `user_id`, how is KSQL reacting?


Don't persist the stream. The problem gets a lot easier if you stop thinking of a message bus as a data store.


How do you do a replay if you don't keep the event somewhere? I did not mean that messages were to be stored in the bus?


In your model example, you forgot to override hashCode as one should: https://stackoverflow.com/questions/2265503/why-do-i-need-to...

EDIT: nvm, I completely missed it


At work, we greatly benefited from a transition from Spring with Java to Play with Scala.

This mostly due to the inherent complexity in Spring and the fact that Spring developers are also Spring experts. When the main developer behind the application left, we struggled to add new features or even fix bugs because the team lacked the Spring expertise.

The rest of the business mostly dealt with Scala, so it was almost a no-brainer to go with Play.

The outcome has been very surprising. The application has better performance overall, is better suited for streaming and we have much more expertise in-house to add features and fix bugs.

The re-write was not without pain though. Spring is a well-supported and very rich framework. It probably does a bunch of things that the casual web developer will likely forget.


Oh man I would love to work with Scala


I would not recommend Scala anymore unless you're a small team with smart devs eager to learn it.

My project is not so small, and the average developer is not interested in spending the time to really learn the language and functional style.

So in the end, we have code that not everyone reads and writes well, an ecosystem that is years behind Java, and tons of incompatibilities between Scala versions as it's still not stable. It's a mess, and most just end up writing Scala imperatively a la Java from 2005.

Java 8 is actually pretty good and has a lot of the power of Java without the complex features few use correctly. I'm hoping Kotlin continues improving, too.


This was not my experience. Of course you need a team that has the drive to learn the language, whatever it is.

If you go out on your own and decide to build your project with Scala and a big focus on using a functional style but no one else is following, you are going to have a bad time.

I'd like to know what are your issues with Scala and versions because we don't have these. IMHO, Scala is years ahead of Java and it's possibly so different that comparing them does not even make sense.

The transition from Java to Scala is awkward because you can do most of the things you do with Java in Scala. With that said, I think it's counter productive to do it. Scala is very different and approaching it with an imperative Java (even an OO) style is the worst thing to do.

One thing that I realized is that some people can go off and write cryptic code in Scala, very easily. This is true of other languages as well but in Scala, you can do a lot of things that will make you regret it the next time you try to read the code.


The last problem we had was with various Fintatra / Finagle libraries written in 2.10.

We wanted to upgrade some other - not all - projects to newer versions of 2.11.

It wouldn't work because our Thrift clients were incompatible with stubs compiled with 2.10. We didn't have the time and manpower to rewrite all the dependent services, and so we're stuck on 2.10.

This is something that'd never happen with even old-school SOAP or similar web services. It kind of left a bad taste in my mouth.


Can you speak more specifically regarding the ecosystem being behind? I'm a recent addition to the Scala development world and love it. That being said I'm eager to listen to any opposing viewpoints.


I had the exact same feeling.

Well written OP.


> YouTube's discoverability is horrible

The is pretty much my opinion as well. Fortunately I use HLTV.org to find out where the games are streamed but yeah, this is the main problem with YouTube as it is right now.

It also misses the Clip feature where you can export a part of the stream as a video to share with others.

The player is great though, you can fast forward, rewind 30 seconds, it performs amazing even on low performance machine/connections.


I was not aware that one could that.


You mind find inspiration from http://passwordstore.org


Maybe we could let them know so they can fix the issue (maybe you did already)?


"And yes, I've opened mutiple supports tickets over the year."


Completely irrelant to Sigma itself, but the landing page crashes FF on a Galazy S6 :(


Seriously, big thanks to Underscore for this. The Type Astronaut's Guide to Shapeless is amazing and I'd recommend it to anyone who is an intermediate Scala developer that wants to step up it's game.


Consider applying for YC's Winter 2026 batch! Applications are open till Nov 10

Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: