No such thing as exactly-once delivery

Anthony Accomazzo

@accomazzo

•

Sep 29, 2024

•

4 min read

We're Sequin, a Postgres CDC tool to streams and queues like Kafka, SQS, HTTP endpoints, and more. We recently had a debate over on a GitHub issue about delivery semantics. Below, we expand on our take.

We say Sequin is a system with "at-least-once delivery" and "exactly-once processing" guarantees.

What's the difference between delivery and processing? And why don't we consider Sequin an "exactly-once delivery" system?

At-most-once vs at-least-once delivery

At-most-once delivery is what many pub/sub systems offer. Messages in at-most-once delivery systems are ephemeral. The publisher blasts out a message. If a subscriber isn't subscribed or there's a network issue in delivery, the subscriber will miss the message.

Postgres' LISTEN/NOTIFY is an example of an at-most-once delivery system.

In at-least-once delivery systems, the system guarantees delivery. To do this, it persists the message and its delivery state. It won't consider a message delivered until it receives confirmation that the receiver received it:
We say "at-least-once" because there's always a chance a message will be redelivered. A receiver might receive a message, but before the receipt is confirmed by both the database and the receiver, there's an interruption:

This is the two-phase commit or distributed transaction problem. The sender needs to commit the fact that the message was delivered to two places: to the receiver and the database.

If it tells the receiver "cool, confirmed you received the message" before the database commit, there's a risk the database commit will fail and the system will erroneously assume the message has not been delivered yet.

But if it instead commits delivered=true to the database first, there's a risk its connection to the receiver will be interrupted before it can send the confirmation. If that happens, it needs to revert delivered back to false – but what if that commit fails?

Exactly-once delivery

This is why exactly-once delivery is a platonic ideal you can only asymptotically approach. You can't transactionally flip two bits in two different physical locations with a 100% guarantee.

You can get pretty dang close with some work. For example, right before you commit to both places, you can ready both systems. If you have an open transaction with your database and recently verified your TCP connection to your receiver is still good, you've really improved your odds. But perfection is not possible.

That's why we don't say Sequin offers exactly-once delivery anywhere. We say it offers at-least-once delivery and exactly-once processing.

Processing is the the full message lifecycle: the message was delivered to the receiver, the receiver did its job, and then the receiver acknowledged the message.

With that definition, SQS, Kafka, and Sequin are all systems that guarantee exactly-once processing. The term processing captures both the delivery of the message and the successful acknowledgment of the message.

Of course, acknowledgements do not make the two-phase commit problem go away. For example, imagine when a worker processes a message, it performs a side effect like sending an email. It very well could receive the message, send the email, but then due to a network error fail to acknowledge the message. In these systems, that message would be reprocessed, resulting in two sent emails!

In my mind, the terms at-most-once and at-least-once delivery help us distinguish between delivery mechanics. And the term "exactly-once processing" indicates it's a messaging system with at-least-once delivery and acknowledgments.

No matter what, I hope this helps you appreciate the fact that a messaging system can only get you so far. As the client, you need to design your system to handle these edge cases gracefully.

Or as Smokey the Message Bearer might say:

🫵 Only you can prevent reprocessing issues

Given the two-phase commit problem, there are a few things you can do to mitigate reprocessing issues.

Most important, account for the fact that you can't get to 100% in your design. You have 3 overall options in your system design:

1. Accept the consequences (i.e. bugs) resulting from redeliveries
2. Make your system idempotent so that redeliveries are OK
3. Choose at-most-once delivery and accept some messages will never be delivered

Which route you take really depends! For example, if you're a bank, you can't miss a transaction, so you'll make your system idempotent. If you're processing analytics and decide you'd prefer to miss an event than count it twice, you'll err towards at-most-once.

Second, break up your messages into units of work as needed. If one message will cause 10 side effects, be mindful the worker can crash at any point during that process and leave your system in an inconsistent state. This doesn't mean that every message should correspond to a single operation or side effect. But it does mean that you should consider things like how to make multi-step workflows eg idempotent.

And last, configure your timeouts correctly. In SQS and Sequin, you can configure a consumer's visibility timeout. This specifies how long after a message has been delivered the system waits until it considers the message eligible for redelivery. The idea is that if the system doesn't hear back from a worker, it will assume it crashed, and so redeliver the message.

You want to set your visibility timeout to a conservative number that gives your worker time to finish. In addition, if your runtime allows it, you want to set a hard timeout on your worker's processing time that's below the visibility timeout. That way, when the visibility timeout expires, you know ~for sure that the worker is down.