left arrow Back to posts

Why duplicates plague CDC pipelines—and how to root them out

Anthony Accomazzo
@accomazzo
5 min read

In no such thing as exactly once delivery, we discuss how exactly-once delivery is an asymptote: something a system can approach, but never guarantee 100% of the time. Network partitions and crashes make it impossible to ensure a downstream system saw an event precisely once.

A tempting – almost nihilistic – take-away from that blog post is: if exactly once delivery isn't achievable, what's the point of even trying? Make your system at-least-once and be done with it.

But that logic fails on two counts:

  1. Duplicates are costly. They may be inevitable, but they're still undesirable. They corrupt analytics and audit logs and trigger double side-effects.
  2. And at-least-once systems can minimize duplicates. With a little effort, they can track deliveries for idempotency. Or send messages with idempotency keys so that consumers can minimize duplicates on their side.

Debezium, a change data capture tool, is squarely at-least-once. Sequin aims for "exactly-once processing": barring ephemeral issues like a sudden network partition between Sequin and its Redis instance (used for idempotency), it will deliver a message exactly once.

The rest of this post explains the engineering choices behind that difference.

How WAL‑based CDC pipelines create duplicates

Postgres' logical replication is driven by Postgres' write-ahead log (WAL). Subscribers create a replication slot on a Postgres database. They then receive an ordered stream of changes that occurred in the database: every create, update, and delete.

The replication slot is a log stream. Like Kafka, the stream is offset based. In Postgres, the offset is called a log sequence number or LSN.

A subscriber to a Postgres replication slot:

  1. Pulls a batch of change messages.
  2. Ships the messages to a sink (e.g. Kafka, SQS).
  3. Periodically advances the LSN in Postgres.

Advancing the LSN means advancing the restart LSN for the replication slot: if the subscriber were to disconnect, the restart LSN indicates the position Postgres will start replaying messages from when it reconnects.

This is not only important for beginning the next connection at an efficient location. It's also important to allow Postgres to throw away unnecessary WAL. If a subscriber doesn't consistently update its restart LSN, Postgres will retain old WAL files, causing storage to fill up.

So, at any given time, a change data capture pipeline is in a partial commit state: it has pulled in many messages, some of them have been written to the sink, but the LSN/offset has not yet been advanced.

If the connector crashes while in that partial commit state, Postgres will replay every message after the restart LSN on reconnect.

Debezium follows this pattern. It relies on a restart LSN to track which messages have been processed, both in Postgres and its own internal store. When Debezium pulls a batch of changes from the WAL, it doesn't mark the LSN as processed until after it has successfully written those changes to its configured sink (like Kafka).

This means effectively every time you restart Debezium or Debezium's connection to Postgres is cycled, you'll get some number of duplicate messages. For high-throughput databases, these events can easily cause tens of thousands of duplicate deliveries.

As you'll see in a bit, Sequin protects against duplicates on restarts and reconnects.

Why replays are bad

The deleterious effects of replays vary based on the use case:

Replication: Flapping

When replicating data from a database to a destination data store, replays could mean drift. Your destination data store can end up with multiple copies of a row.

Fortunately, if your source data has unique properties on it – like primary keys – you can use those to deduplicate your upserts.

However, even with upserts, replays can still cause eventual consistency or "flapping".

Consider a scenario:

  1. A source row value changed from A to B.
  2. Some time later, it changes from B to C.
  3. Debezium is restarted.
  4. When Debezium boots up, it sends the "A→B" update event again first.

The row in the destination has flapped back to the older state "B" for whatever period before it receives the "B→C" update again.

Audit logging: Duplicates

Audit tables strive for one row == one fact. A duplicate "user promoted to admin" muddies the timeline and confuses end-users.

Side effects: Accidentally doing things twice

When eventing off messages, replays are usually undesirable. It might result in multiple side effects, like sending multiple password-reset links. Or worse: double-billing a credit card.

How Sequin minimizes duplicates

Because duplicates are problematic, Sequin works hard to minimize them.

Idempotency keys

Sequin generates idempotency keys on both real-time change messages and on backfill messages.

Change messages come through the WAL. Each WAL message is associated to a transaction. Transactions in Postgres have a unique LSN.

Messages within a transaction do not have their own LSN or any other unique identifier. However, the order of messages within a transaction is stable. So, Sequin annotates each message inside a transaction with a commit_idx, or the index of the message inside the transaction.

Together, with the commit_lsn and commit_idx, Sequin is able to generate a unique idempotency_key for each message. The idempotency_key you'll find on each message is just a base 64'd string of "#{commit_lsn}:#{commit_idx}"!

For backfills, Sequin needs to use a different strategy, as backfill rows are sourced directly from the table (via select queries) vs from the WAL. So, Sequin uses a combination of the backfill's ID and the source row's primary keys to produce its idempotency_key for a message. That produces a stable key that ensures consumers only process a given read message for a row once per backfill.

Sequin's idempotency filter

Sequin uses its idempotency keys to filter "at the leaf", right before delivering to the destination.

Whenever Sequin delivers a batch of messages to a sink, it writes the idempotency keys for each message in that batch to a sorted set in Redis. Therefore, before it delivers a batch of messages to a sink, it can filter out any messages that were already delivered against that sorted set.

That means the only situation where Sequin will replay a message is one where:

  1. Sequin was able to conduct a read operation against the Redis sorted set to do idempotency filtering.
  2. Sequin delivers a batch of messages to the sink.
  3. After delivery, Sequin is unable to reach Redis to mark the messages as delivered.

So, this should happen rarely. And when it does happen, only affect the messages on the knife's edge between Redis being available and Redis being unavailable.

Of course, when something can happen – if even in very rare circumstances – it's something a team ought to account for. We're planning on adding an alert around such situations so that teams can do whatever manual intervention is necessary in the situation where a batch of messages may have been re-delivered.

What's more, Sequin includes the idempotency key as a field, metadata.idempotency_key, on all messages. Some destinations let Sequin use this key to de-dupe delivery. For example, with SQS and NATS, you can set a de-duplication key on write – making Sequin's delivery model exactly-once. Otherwise, the field is available to users as a last layer of idempotency as desired.

Conclusion

The difference between Debezium and Sequin's approach to change data capture comes down to how they handle the inevitable reality of duplicates. While Debezium accepts duplicates as a natural consequence of at-least-once delivery, Sequin actively works to minimize them through idempotency tracking and filtering.

For use cases where duplicates are particularly costly (audit logs, financial transactions, or systems that trigger expensive side effects), Sequin's approach of tracking message delivery at the individual message level can provide significant value.

While exactly-once delivery may be impossible to guarantee 100% of the time, systems can still be designed to make duplicates extremely rare rather than routine.

You can read more about Sequin's consistency model in the docs.