Debezium is a popular solution for streaming changes from Postgres. It was amongst the first generation of tools purpose built to help developers implement a change data capture (CDC) pattern. While Debezium introduces complex dependencies like Kafka to operate, in total it abstracts away many of the challenges around detecting and delivering every change in your database.
This post aims to help developers understand how Debezium works with Postgres so you can better architect your system.
What is Debezium
Debezium captures row-level changes in databases and streams them to Kafka topics. It's implemented as a set of Kafka Connect source connectors, with each connector ingesting changes from a different database system using that database's native logging capabilities.
Unlike polling or dual-write approaches, Debezium's log-based CDC implementation:
- Guarantees that all data changes are captured without gaps. Including every insert, update, and delete.
- Produces change events with minimal delay
- Requires no modifications to your data models (no "Last Updated" columns)
- Preserves original record states and transaction metadata
Debezium then transforms each row-level operation into change event records, which are then streamed to distinct Kafka topics. Applications can consume these event streams to react to data changes in near real-time, enabling a wide range of use cases from cache invalidation to analytics pipelines.
How Does Postgres Emit Logs to Debezium
When integrating with Postgres, Debezium interfaces with Postgres’ write-ahead log (WAL). The WAL is essentially a sequential log of all database modifications. Importantly, any change in the Postgres database is first added to the log and flushed to persistent storage (hence write-ahead) before the change is actually committed.
Consider what happens when you make an INSERT
in Postgres:
- Open transaction: Postgres validates the insert statement, permissions, and creates an execution plan. Ultimately, a transaction is opened for the insert.
- Create WAL record: Before modifying any data in memory, Postgres creates a WAL record that contains the details of the operation. The WAL record includes a Log Sequence Number (LSN) that uniquely identifies the position / order of the operation.
- In-memory buffer: The WAL record is written to an in-memory WAL buffer just prior to the actual row being inserted into the in-memory buffer for the table.
- Flush to disk: When the in-memory transaction commits, the WAL buffer is flushed to disk and the transaction is marked as committed. At this point, your client receives a confirmation that the insert completed. This is also when a client connected to the WAL (i.e. Debezium) receives the WAL entry for the insert.
- Checkpointing: At some later point the change made in memory is also written to the data files.
Debezium connects to Postgres via a replication slot to read the records in the WAL (at step four above). The WAL records in the slot are decoded from binary using the pgoutput
plugin. This approach to detecting changes comes with important benefits:
- Exact order: By using the LSN as a cursor, Debezium is able to process changes in order.
- Only completed transactions: Changes are only flushed to the WAL when the transaction commits, so Debezium never captures partial or rolled-back transactions.
- Minimal database overhead: Changes are committed to the WAL as a normal part of the transaction. As long as Debezium pulls and flushes messages from the replication slot (foreshadowing!), no new overhead is added to the database.
The Debezium Capture Process
When you first connect Debezium to a Postgres table, it follows a well-defined process to capture and stream changes to guarantee consistency:
Initial snapshot (a.k.a backfill)
When starting a new stream on a Postgres table, you can configure Debezium to perform a snapshot of the table. This will add the current state of every row in the table to your Kafka topic as read
events.
To complete the snapshot before streaming new changes (all while ensuring no change is missed in the interim), Debezium does the following:
- Starts a SELECT transaction
- Notes and stores the current LSN in the WAL
- Scans the configured tables and generates
read
events for each row - Commits the SELECT transaction and records the snapshot completion
Continuous streaming
After completing the snapshot, Debezium transitions to streaming changes from the exact WAL position (using the stored LSN) where the snapshot occurred.
As described above, Debezium leverages built-in logical replication to capture these changes from the database with strict ordering.
Event transformation
Debezium transforms PostgreSQL's logical decoding events into Debezium's standardized change event format, which includes:
- A consistent structure for inserts, updates, and deletes
- The table's primary key as the event key
- Both before and after states of the modified row
- Metadata about the source transaction and timing
You can also use Single Message Transforms (SMTs - which are a feature of Kafka Connect) to modify message payloads as they flow through the pipeline.
Kafka topic routing
Each table's changes are directed to a dedicated Kafka topic, typically named in the pattern <server-name>
.<schema-name>
.<table-name>
. Once delivered to Kafka, the change is in a durable stream that can be processed by consumers.
Offset tracking
Once a change has been successfully delivered to Kafka, Debezium will periodically update Postgres with the confirmed_flush_lsn
associated with the change. Postgres will then remove WAL records from the replication slot.
Importantly, if Debezium fails to deliver messages to Kafka, it won’t notify Postgres it can flush messages, which can cause the replication slot to fill up and could crash the database.
Debezium Postgres Benefits
Complete data capture
Critically, Debezium’s approach to CDC guarantees that every complete insert, update, and delete will be captured from your database. By using WAL records with LSN’s as a cursor, it can provide strict ordering guarantees and is able to enrich messages with transaction metadata.
Non-intrusive
Debezium requires no schema modifications (i.e. no last_updated
column) and leverages the WAL ( a normal part of every database transaction) to minimize database overhead.
Consistency
With both snapshots and streaming, Debezium is able to deliver every row in a table to a Kafka topic.
Debezium Postgres Limitations
Complex dependencies
Debezium introduces complex dependencies like Kafka to achieve even simple CDC use cases. This is a significant operational burden.
Replication slot risk
Debezium does not include a mechanism (e.g. a dead letter queue) for handling bad messages that fail to deliver to Kafka. This means one bad message can lock up Debezium, cause the replication slot to back up, and potentially consume significant resources on the database.
Throughput constraints
Tuning Debezium to match the throughput of your Postgres database is not easy. You’ll need to be comfortable creating multiple replication slots, tuning Debezium beyond its default config, and getting very familiar with Kafka. Debezium can also struggle to handle common Postgres data types like JSONB and TOAST columns.
Conclusion
Debezium has set the standard for Postgres change data capture. By leveraging log-based replication and the tooling of Kafka Connect, it’s a complete CDC platform.
However, because it isn’t tuned to Postgres and requires Kafka as a dependency, Debezium isn’t easy to use or maintain. A single message can jam your pipeline, so you’ll be spending significant time tuning and monitoring the system (in the JVM!) to keep things running smoothly.
At Sequin, we’re improving on Debezium’s approach - tuning it specifically for Postgres and removing the Kafka dependency. It’s resulting in a 6.8X improvement in speed with higher availability and resiliency - all with nicer tooling.