We're Sequin, an open source message stream like Kafka, but built on Postgres. Building event driven workflows is a common use case for Sequin, which means writing events to an event table with a TTL. This lead us to explore strategies for time-based retention in Postgres.
Whether you're managing event logs, audit trails, or time-series data, there comes a point when you need to start pruning old records. Can Postgres do it? And what tools are at your disposal?
In this post, we'll explore three approaches to implementing retention policies in Postgres: using pg_cron
for scheduled deletions, using pg_partman
for partition-based retention, and rolling your own partition-based solution.
Using pg_cron
pg_cron
is a Postgres extension for running cron jobs. It's a simple way to run scheduled tasks. You can use it to delete old data from your tables on a regular interval.
Basic setup
First, enable the extension:
create extension pg_cron;
Then, you create a cron job to delete old data like this:
select cron.schedule('delete-old-events', '0 0 * * *', $$
delete from events
where inserted_at < now() - interval '30 days'
$$);
This will run the delete statement on a daily basis at midnight.
If your table has higher throughput, you'll want to run delete statements more frequently. You'll also want to batch your deletes to mitigate locking issues:
Advanced setup
In order to loop your deletes, create a stored procedure. This will let you commit each delete statement, reducing the time each delete locks the table:
create or replace procedure cleanup_old_events()
language plpgsql as $$
declare
batch_size integer := 10000;
deleted_count integer;
begin
loop
delete from events
where id in (
select id
from events
where inserted_at < now() - interval '30 days'
limit batch_size
for update skip locked
);
deleted_count := FOUND; -- This is equivalent and more idiomatic in Postgres
commit;
exit when deleted_count < batch_size;
end loop;
end;
$$;
The procedure exits when the number of deleted rows is less than the batch size, indicating it has reached the end. FOUND
is a special variable in Postgres that returns the number of rows affected by the last statement.
Then, you can run the procedure in a cron job like this:
select cron.schedule('delete-old-events', '*/10 * * * *', 'call cleanup_old_events()');
This will run the procedure every 10 minutes.
Monitoring
pg_cron
logs to the cron.job_run_details
table. You can query it for information about recent jobs, including their status
, start_time
, end_time
, and return_message
.
Can it scale?
A common concern with pg_cron
is that it will not be able to keep up with write volume or that it will impact performance.
Deletes need to visit every affected row and mark it for deletion, creating "dead tuples." This is work for the autovacuum process to do later to clean up the disk. So there's immediate pressure (scanning/marking) and delayed pressure (vacuum cleanup).
That said, pg_cron
can get you surprisingly far. Especially on Postgres 17, which has greatly improved vacuum performance.
I've done some basic benching, simulating a system that experiences 100M writes/day (≈1K writes/sec) on Postgres 17. On a 16 CPU/64GB machine, I was able to run pg_cron
with batch deletes every 5 minutes without any issue. Delete queries took under 35ms to run and writes were not impacted. The vacuum seemed able to easily keep up.
My simulation was far from perfect and had very little read volume. But it did suggest that a basic pg_cron
retention strategy may be sufficient for higher-volume systems.
Using pg_partman
Compared to running DELETE
statements, using partitions to enforce a retention policy is much more efficient. If you have a table sequin_events
, you can partition it by day like so:
sequin_events
├── sequin_events_p20241025
├── sequin_events_p20241024
├── sequin_events_p20241023
├── sequin_events_p20241022
└── sequin_events_p20241021
Then, to drop all data, you need only run drop partition
. As a metadata operation, this can remove millions of rows by just updating a table's partition definition and deallocating the underlying files. It's more like changing a pointer in the filesystem than touching the actual data.
pg_partman
is an extension for Postgres for creating and managing partitions. It comes with a background worker that will create new partitions and drop (or archive) old partitions for you.
In my opinion, it's a bit complicated to set up. There are a number of configuration options, and lots of new concepts. And there's a lot to be mindful of when it comes to maintenance and operation.
However, it's a nifty tool. What follows is a glide path for using pg_partman
for enforcing a time-based retention policy:
Setting up pg_partman
First, you'll need to make sure the table you want to create a retention policy on is partitioned by range:
create table public.sequin_events (
id bigserial,
inserted_at timestamp with time zone not null default now(),
-- ...
primary key (id, inserted_at)
) PARTITION BY RANGE (inserted_at);
After installing and setting up pg_partman
, you can initialize it in your database:
create schema partman;
create extension pg_partman schema partman;
There are four table types in a pg_partman
setup:
- The parent table is the primary table you're going to partition. It's the table you run your
insert
andselect
statements against. Then, your queries will be routed to the appropriate partition table. (This is standard Postgres stuff.) - Partition tables are standard Postgres partitions. If you have a 30 day retention policy, there will be ~30 partition tables, one for each day.
- The template table is a table that's based on your parent table that
pg_partman
uses to generate the schema for partition tables. Template tables are a way for you to have some drift in the schema between your parent table and its partition tables. Most likely for your purposes you don't need any drift so this will just point to your parent table. - The default table is a partition table that rows are routed to if they don't match any of the existing partitions. We'll talk about this more in a moment.
So, to setup pg_partman
on your parent table, run this command:
select partman.create_parent(
p_parent_table := 'public.sequin_events',
p_control := 'inserted_at',
p_interval := '1 day',
p_automatic_maintenance := 'on',
p_default_table := false,
p_template_table := 'public.sequin_events'
);
Here are the options you provide:
p_control
: This is the column to partition by.p_interval
: This is a Postgresinterval
value specifying the intervals for your partitions.p_automatic_maintenance
: Uses thepg_partman
background worker to perform maintenance (i.e. create and drop partitions). (Be sure to follow README instructions to set this up.)p_default_table
: This turns the default table off.p_template_table
: We specify that the parent table is the template for partitions, i.e. there is no difference in the schema between the parent and its children.
This function call:
- Creates a row in
partman.part_config
. This is the table the maintenance worker references when it wakes up to figure out what partitions it needs to manage and how to manage them. - Creates the default table.
- Runs a maintenance loop, which will create your initial set of partitions.
You'll notice the partition tables in your database now:
sequin_events
├── sequin_events_p20241025
├── sequin_events_p20241024
├── sequin_events_p20241023
├── sequin_events_p20241022
└── sequin_events_p20241021
Note that the create call does not specify a retention policy. Strangely, create_parent
does not accept all possible configuration options. So, next, you need to update the configuration you just created with your retention policy:
update partman.part_config
SET
retention = '1 day',
retention_keep_table = false,
infinite_time_partitions = true
where parent_table = 'public.sequin_events';
If you'd prefer to have pg_partman
keep old partitions but just detach them from the primary table, that's possible. You can leave retention_keep_table
to true
, the default.
In my testing, it seemed necessary to setinfinite_time_partitions
totrue
in order to getpg_partman
to correctly create future partitions (default isfalse
). It's admittedly not clear to me why. Every other guide and tutorial I've seen also sets it totrue
with no explanation.
With that, your table is setup for time-based retention.
For testing purposes, you can use much lowerp_interval
andretention
values to see data get cycled faster. You'll just want to be sure to set thepg_partman_bgw.interval
value in yourpostgresql.conf
to a lower value too, so the background worker runs more frequently. See the README.
The default table
Previously, I recommended configuring your setup without a default table. As you'll see, there isn't really a "right" answer here:
The default table (or default partition) is where rows are routed if they don't fit the criteria for the other partitions.
Let's say you have a retention policy of 5 days. So you have 10 partition tables, 5 for each of the previous 5 days and 5 for the next 5 days. Maybe there was a bug in your system, and now you're trying to insert a row from 6 days ago. But because there is no partition for that row to route to, you'll get an error:
ERROR: no partition of relation "sequin_events" found for row
DETAIL: Partition key of the failing row contains (committed_at) = (2024-10-27 07:24:13.524879).
If you had a default table, the insert would have succeeded, as the row would have been routed to the default partition.
However, this escape hatch comes with tradeoffs. The tradeoff depends on if the row is "too old" and comes before the retention window. Or "too far in the future" and comes ahead of the buckets that pg_partman
"premakes" in advanced.
For rows that fall outside of the retention window because they're too old, they'll be inserted into the default table and retained indefinitely. That's because pg_partman
does not apply a retention policy to the default table.
So, if you choose to have a default table, you'll want to set up some monitoring and alerting so you can manually intervene and deal with those rows.
Alternatively, without a default table, you could just discard those rows as they come in. But, of course, these rows are likely symptomatic of a broader issue in your system that should be addressed.
For rows that fall outside of the retention window because they're too far in the future, things are even more complicated. If they're inserted into the default table, when pg_partman
goes to make a partition for the day corresponding to this future row, it will fail, as the default table contains a row that matches this partition. We can't create the new partition and rebalance rows from default into it in one go.
So, rows inserted into the default table that correspond to a future time bucket are like a ticking time bomb. You'll need to step in and deal with them before they cause issue.
If you're never expecting to insert rows before or far beyond your retention window, it might be best to not have a default table. That will make the most noise in these exceptional circumstances, which may be desirable, as your system is likely in a bad state.
Roll your own
While pg_cron
and pg_partman
are great solutions, sometimes you want more control over your retention strategy. Rolling your own solution can make sense when you want to:
- Keep retention logic in your application layer
- Want to avoid complexity of
pg_partman
- Want to plug into your existing monitoring and alerting
You can roll your own flavor of pg_partman
with a background worker in your application. I'll show you what this looks like in Elixir. We'll use Oban, a job processing library for Elixir:
The table
The parent table can look like this:
create table events (
id bigserial,
inserted_at timestamp with time zone not null,
payload jsonb,
-- other columns...
primary key (id, inserted_at)
) partition by range (inserted_at);
The background worker
The background worker will need to:
- Assess which partitions exists
- Drop partitions that are older than your retention window
- Create new partitions as needed
First, create a module to handle partition management:
defmodule MyApp.Partitions do
@moduledoc """
Manages partitioned tables in the database.
"""
import Ecto.Query
alias Sequin.Repo
@doc """
Formats a date into the pg_partman style partition suffix.
Example: 2024-10-29 -> "p20241029"
"""
def format_partition_name(date) do
"p" <> Calendar.strftime(date, "%Y%m%d")
end
@doc """
Parses a date from a pg_partman style partition name.
Example: "events_p20241029" -> ~D[2024-10-29]
"""
def parse_partition_date(partition_name) do
[date_str] =
~r/_p(\d{8})$/
|> Regex.run(partition_name, capture: :all_but_first)
<<year::binary-size(4), month::binary-size(2), day::binary-size(2)>> = date_str
Date.from_iso8601!(year <> "-" <> month <> "-" <> day)
end
@doc """
Lists all partitions for a given table.
Returns a list of tuples with the partition name and the date the partition corresponds to.
"""
def list_partitions(table_name) do
query = """
select child.relname as partition_name
from pg_inherits
join pg_class parent on pg_inherits.inhparent = parent.oid
join pg_class child on pg_inherits.inhrelid = child.oid
where parent.relname = $1
and child.relkind = 'r'
order by child.relname;
"""
with {:ok, %{rows: rows}} <- Repo.query(query, [table_name, table_schema]),
rows <- List.flatten(rows) do
partitions =
rows
|> Enum.map(fn [name] ->
{name, parse_partition_date(name)}
end)
{:ok, partitions}
end
end
@doc """
Creates a partition for the given date if it doesn't exist
"""
def create_partition(table, date) do
partition_name = "#{table}_#{format_partition_name(date)}"
start_time = partition_start(date)
end_time = partition_end(date)
query = """
create table #{partition_name}
partition of #{table}
for values from ('#{start_time}'::timestamptz)
to ('#{end_time}'::timestamptz)
"""
with {:ok, _} <- Repo.query!(query) do
:ok
end
end
@doc """
Drops a partition table
"""
def drop_partition(partition_name) do
query = "drop table #{partition_name}"
with {:ok, _} <- Repo.query!(query) do
:ok
end
end
@doc """
Returns the start of the day for the given date in UTC
"""
def partition_start(date) do
date
|> DateTime.new!(Time.new!(0, 0, 0), "Etc/UTC")
|> DateTime.truncate(:second)
end
@doc """
Returns the end of the day for the given date in UTC
"""
def partition_end(date) do
date
|> Date.add(1)
|> partition_start()
end
end
Then, you can create a background worker that uses this module to manage your partitions:
defmodule MyApp.PartitionMaintenanceWorker do
use Oban.Worker
alias MyApp.Partitions
@retention_days 30
@future_days 5
@impl Oban.Worker
def perform(%Oban.Job{args: %{"table_name" => table_name}}) do
with :ok <- drop_old_partitions(table_name),
:ok <- create_future_partitions(table_name) do
:ok
end
end
defp drop_old_partitions(table_name) do
cutoff = Date.utc_today() |> Date.add(-@retention_days)
with {:ok, partitions} <- Partitions.list_partitions(table_name) do
partitions
|> Enum.each(fn {name, date} ->
if Date.compare(date, cutoff) == :lt do
:ok = Partitions.drop_partition(name)
end
end)
:ok
end
end
defp create_future_partitions(table_name) do
today = Date.utc_today()
with {:ok, partitions} <- Partitions.list_partitions(table_name) do
existing_partitions =
partitions
|> Enum.map(fn {_, date} -> date end)
|> MapSet.new()
today
|> Date.range(Date.add(today, @future_days))
|> Enum.reject(&MapSet.member?(existing_partitions, &1))
|> Enum.each(fn date ->
:ok = Partitions.create_partition(table_name, date)
end)
:ok
end
end
end
Choosing a retention strategy
I'm a big fan of starting simple, and pg_cron
's simplicity is hard to beat. It's easy to understand, easy to set up, and with Postgres 17's improved vacuum performance, it can handle surprisingly high volumes.
Or, instead of pg_cron
, you can even use your existing job processing system to run deletes on a regular interval.
However, there are some maintenance considerations with cron-based deletes. With pg_cron
, you'll want to keep an eye on cron.job_run_details
to ensure jobs are running successfully and not taking too long. And your biggest risk is falling behind. If jobs are silently failing and a huge backlog accumulates, you may get yourself into a situation where the backlog is difficult to clear.
So, you'll want to feed stats from a cron-based system into your monitoring system so you don't lose visibility.
pg_partman
is more sophisticated and far more efficient, but also more complicated. It guarantees your bottleneck will be your write throughput, not your delete throughput. Its performance is more predictable – dropping a partition takes roughly the same time whether it contains 1 row or 1 million rows.
If you're viewing your database via a database client often, you can also see if partitions are being dropped. And if you accumulate a large backlog of partitions, you can intervene manually easily and fast.
For monitoring, pg_partman
works with another extension, pg_jobmon
. You can also simply keep tabs on how many partitions you have.
Finally, with pg_partman
as inspiration, rolling your own solution is very feasible. The nice part is that you can easily integrate it into your existing monitoring, logging, and alerting. You also get full control over your retention logic, and can manage migrations as needed with a system you understand. Of course, there's risk too: drop partition
calls are not for the faint of heart!