State drift

I generally haven't written much about software architecture. People make heuristics into religion. But here is something I thought about: how to build in self-correction into systems. This has been something just vaguely sitting in my head lacking a clear conceptual definition until a whole slew of things popped up today that all had the exact same issue at its core. I'm going to refer to it as state drift lacking a better term for it.

What is state drift?


State drift is when there's two components that synchronize state. Actually, doesn't even have to be two, and the “state” may be somewhat virtual, not in an explicit sense. But to make it easy let's say we have a producer who maintains a state, and the producer sends deltas to an observer, that updates its own state. If care isn't taken, the state in the observer will start to drift away from what it's supposed to be. Without a self-correction mechanism, you are forever screwed.

These three real world cases came up independently of each other today:

  • Service updates over email – sadly some of our vendors have suboptimal API's (this is the mortgage industry) and we have to resort to email parsing to get certain updates. But emails get lost and formats change and suddenly you realize your view of the outstanding service order doesn't match the vendor's view.
  • Webhooks – have a lot of other annoying issues (more on that later) but in particular there is rarely any redelivery guarantees. A lot of API's look call back using webhooks when some data is ready (for operations that take longer than a second) but what if that callback gets lost in the ether?
  • Websockets to push state to clients – we use a single page app and some of the state is pushed from the backend to the frontend, meaning the frontend has its own view of what the backend state looks like.

Why can't you just write code with 100% uptime? :trollface:

Ok great thanks! I should have thought about it.

But yeah the problem is that if you lose a single delta, if there's no way to self-correct, it's game over. Sadly it doesn't matter if you have 99.99% uptime. Conversely, a system with a level of self-correction built in can afford quite a lot of errors and it will recover very nicely. In a lot of scenarios, you don't even have a choice. TCP connections randomly drop, and messages will get lost. Ideally you want both solid redelivery guarantees and some way to recover from corrupted state.

How to solve it?

Just briefly wanted to mention that I'm not a super big fan of webhooks, and I will elaborate a bit later.

I think the best solution when you have webhooks (or any delta pushing mechanism) is to treat the updates opportunistically, and use a separate mechanism to reconcile state every once in a while. This can often be done in the most simple way – every once in a while, copy the entire state from the producer to the observer.

Of course, you can make this more advanced – afaik Git uses Merkle trees to synchronize state. Another example is rsync which has an algorithm that compares two directory listings before it copies any files. These are state synchronization mechanism that are designed to detect and fix any discrepancies.

Luigi (mentioning it mostly because I'm the author) has a particularly simple reconciliation mechanism – the file system. Every time a workflow starts, it checks what's been done and what's not been done by seeing what exists (although this can be any user-provided checkpoint, in practice it's local files, HDFS, S3 or something similar). It pushes everything to the server during scheduling. During the work phase, it pushes deltas to the server, and the server tells the worker what to work on next. Occasionally either of those things fail, but we can ignore without much concern. The fact that a full state synchronization happens every scheduling means the state drift is limited.

Now that I think about it, the same principle applies to video compression. If there were no key frames, the state drift would increase slowly over time. Key frames bring the state drift back to 0.

Full state reconciliation might be way too expensive, so another complementary strategy is to make sure there are very strong delivery guarantees of the deltas. Systems Kafka which uses at distributed durable commit log to store a message queue. If a consumer dies, they can re-consume all messages from a certain point in time. It would be awesome if websockets could work the same way so that if the TCP connection is lost, we know where to resume from.

Speaking of webhooks…

Not a super big fan. Here are some reasons

  • I have a simple script to call some remote server but suddenly I need to run a HTTP server inside it
  • … and I also need to poke holes through a bunch of firewalls
  • A huge problem is the lack of redelivery guarantees. If you don't have 100% uptime, which is basically impossible, then you will lose data. If that data is not possible to recover, you have a problem.
  • In practice often the URL of the webhooks is hardcoded somewhere, meaning you can't decouple the caller/callee. Ideally you want multiple clients to be able to call and receive updates from the same API. The easiest solution to this is to make the webhook URL part of the request payload.
  • Debugging issues is much harder because you don't know where something got lost. If something is always pulling in one end, you know immediately when it's break. With push, that doesn't happen.

… but webhooks are here to stay, so let's accept it. Here's a free open source project idea (I'm actually somewhat tempted to build this): Provide a simple service that lets you expose a webhook to the world, but store all incoming requests in a durable queue. Support long polling (think tail -f) for updates as well. I think it would be almost trivial to build something like this, using a Lambda worker in AWS together with SQS, but I have little experience so I'll defer it for now. But feel free to steal :)

Tagged with: software