Time-based retention strategies in Postgres

Anthony Accomazzo

@accomazzo

•

Oct 29, 2024

•

10 min read

We're Sequin, a Postgres CDC tool to streams and queues like Kafka, SQS, HTTP endpoints, and more. Building event driven workflows is a common use case for Sequin, which means writing events to an event table with a TTL. This lead us to explore strategies for time-based retention in Postgres.

Whether you're managing event logs, audit trails, or time-series data, there comes a point when you need to start pruning old records. Can Postgres do it? And what tools are at your disposal?

In this post, we'll explore three approaches to implementing retention policies in Postgres: using pg_cron for scheduled deletions, using pg_partman for partition-based retention, and rolling your own partition-based solution.

Using `pg_cron`

pg_cron is a Postgres extension for running cron jobs. It's a simple way to run scheduled tasks. You can use it to delete old data from your tables on a regular interval.

Basic setup

First, enable the extension:

create extension pg_cron;

Then, you create a cron job to delete old data like this:

select cron.schedule('delete-old-events', '0 0 * * *', $$
    delete from events 
    where inserted_at < now() - interval '30 days'
$$);

This will run the delete statement on a daily basis at midnight.

If your table has higher throughput, you'll want to run delete statements more frequently. You'll also want to batch your deletes to mitigate locking issues:

Advanced setup

In order to loop your deletes, create a stored procedure. This will let you commit each delete statement, reducing the time each delete locks the table:

create or replace procedure cleanup_old_events() 
language plpgsql as $$
declare
    batch_size integer := 10000;
    deleted_count integer;
begin
    loop
        delete from events 
        where id in (
            select id 
            from events 
            where inserted_at < now() - interval '30 days'
            limit batch_size
            for update skip locked
        );
        
        deleted_count := FOUND;  -- This is equivalent and more idiomatic in Postgres
        
        commit;
        
        exit when deleted_count < batch_size;
    end loop;
end;
$$;

The procedure exits when the number of deleted rows is less than the batch size, indicating it has reached the end. FOUND is a special variable in Postgres that returns the number of rows affected by the last statement.

Then, you can run the procedure in a cron job like this:

select cron.schedule('delete-old-events', '*/10 * * * *', 'call cleanup_old_events()');

This will run the procedure every 10 minutes.

Monitoring

pg_cron logs to the cron.job_run_details table. You can query it for information about recent jobs, including their status, start_time, end_time, and return_message.

Can it scale?

A common concern with pg_cron is that it will not be able to keep up with write volume or that it will impact performance.

Deletes need to visit every affected row and mark it for deletion, creating "dead tuples." This is work for the autovacuum process to do later to clean up the disk. So there's immediate pressure (scanning/marking) and delayed pressure (vacuum cleanup).

That said, pg_cron can get you surprisingly far. Especially on Postgres 17, which has greatly improved vacuum performance.

I've done some basic benching, simulating a system that experiences 100M writes/day (≈1K writes/sec) on Postgres 17. On a 16 CPU/64GB machine, I was able to run pg_cron with batch deletes every 5 minutes without any issue. Delete queries took under 35ms to run and writes were not impacted. The vacuum seemed able to easily keep up.

My simulation was far from perfect and had very little read volume. But it did suggest that a basic pg_cron retention strategy may be sufficient for higher-volume systems.

Using `pg_partman`

Compared to running DELETE statements, using partitions to enforce a retention policy is much more efficient. If you have a table sequin_events, you can partition it by day like so:

sequin_events
  ├── sequin_events_p20241025
  ├── sequin_events_p20241024
  ├── sequin_events_p20241023
  ├── sequin_events_p20241022
  └── sequin_events_p20241021

Then, to drop all data, you need only run drop partition. As a metadata operation, this can remove millions of rows by just updating a table's partition definition and deallocating the underlying files. It's more like changing a pointer in the filesystem than touching the actual data.

pg_partman is an extension for Postgres for creating and managing partitions. It comes with a background worker that will create new partitions and drop (or archive) old partitions for you.

In my opinion, it's a bit complicated to set up. There are a number of configuration options, and lots of new concepts. And there's a lot to be mindful of when it comes to maintenance and operation.

However, it's a nifty tool. What follows is a glide path for using pg_partman for enforcing a time-based retention policy:

Setting up `pg_partman`

First, you'll need to make sure the table you want to create a retention policy on is partitioned by range:

create table public.sequin_events (
  id bigserial,
  inserted_at timestamp with time zone not null default now(),
  -- ...
  primary key (id, inserted_at)
) PARTITION BY RANGE (inserted_at);

After installing and setting up pg_partman, you can initialize it in your database:

create schema partman;
create extension pg_partman schema partman;

There are four table types in a pg_partman setup:

The parent table is the primary table you're going to partition. It's the table you run your insert and select statements against. Then, your queries will be routed to the appropriate partition table. (This is standard Postgres stuff.)
Partition tables are standard Postgres partitions. If you have a 30 day retention policy, there will be ~30 partition tables, one for each day.
The template table is a table that's based on your parent table that pg_partman uses to generate the schema for partition tables. Template tables are a way for you to have some drift in the schema between your parent table and its partition tables. Most likely for your purposes you don't need any drift so this will just point to your parent table.
The default table is a partition table that rows are routed to if they don't match any of the existing partitions. We'll talk about this more in a moment.

So, to setup pg_partman on your parent table, run this command:

select partman.create_parent(
  p_parent_table := 'public.sequin_events',
  p_control := 'inserted_at',
  p_interval := '1 day',
  p_automatic_maintenance := 'on',
  p_default_table := false,
  p_template_table := 'public.sequin_events'
);

Here are the options you provide:

p_control: This is the column to partition by.
p_interval: This is a Postgres interval value specifying the intervals for your partitions.
p_automatic_maintenance: Uses the pg_partman background worker to perform maintenance (i.e. create and drop partitions). (Be sure to follow README instructions to set this up.)
p_default_table: This turns the default table off.
p_template_table: We specify that the parent table is the template for partitions, i.e. there is no difference in the schema between the parent and its children.

This function call:

Creates a row in partman.part_config. This is the table the maintenance worker references when it wakes up to figure out what partitions it needs to manage and how to manage them.
Creates the default table.
Runs a maintenance loop, which will create your initial set of partitions.

You'll notice the partition tables in your database now:

sequin_events
  ├── sequin_events_p20241025
  ├── sequin_events_p20241024
  ├── sequin_events_p20241023
  ├── sequin_events_p20241022
  └── sequin_events_p20241021

Note that the create call does not specify a retention policy. Strangely, create_parent does not accept all possible configuration options. So, next, you need to update the configuration you just created with your retention policy:

update partman.part_config
SET
	retention = '1 day',
	retention_keep_table = false,
	infinite_time_partitions = true
where parent_table = 'public.sequin_events';

If you'd prefer to have pg_partman keep old partitions but just detach them from the primary table, that's possible. You can leave retention_keep_table to true, the default.

In my testing, it seemed necessary to set infinite_time_partitions to true in order to get pg_partman to correctly create future partitions (default is false). It's admittedly not clear to me why. Every other guide and tutorial I've seen also sets it to true with no explanation.

With that, your table is setup for time-based retention.

For testing purposes, you can use much lower p_interval and retention values to see data get cycled faster. You'll just want to be sure to set the pg_partman_bgw.interval value in your postgresql.conf to a lower value too, so the background worker runs more frequently. See the README.

The default table

Previously, I recommended configuring your setup without a default table. As you'll see, there isn't really a "right" answer here:

The default table (or default partition) is where rows are routed if they don't fit the criteria for the other partitions.

Let's say you have a retention policy of 5 days. So you have 10 partition tables, 5 for each of the previous 5 days and 5 for the next 5 days. Maybe there was a bug in your system, and now you're trying to insert a row from 6 days ago. But because there is no partition for that row to route to, you'll get an error:

ERROR:  no partition of relation "sequin_events" found for row
DETAIL:  Partition key of the failing row contains (committed_at) = (2024-10-27 07:24:13.524879).

If you had a default table, the insert would have succeeded, as the row would have been routed to the default partition.

However, this escape hatch comes with tradeoffs. The tradeoff depends on if the row is "too old" and comes before the retention window. Or "too far in the future" and comes ahead of the buckets that pg_partman "premakes" in advanced.

For rows that fall outside of the retention window because they're too old, they'll be inserted into the default table and retained indefinitely. That's because pg_partman does not apply a retention policy to the default table.

So, if you choose to have a default table, you'll want to set up some monitoring and alerting so you can manually intervene and deal with those rows.

Alternatively, without a default table, you could just discard those rows as they come in. But, of course, these rows are likely symptomatic of a broader issue in your system that should be addressed.

For rows that fall outside of the retention window because they're too far in the future, things are even more complicated. If they're inserted into the default table, when pg_partman goes to make a partition for the day corresponding to this future row, it will fail, as the default table contains a row that matches this partition. We can't create the new partition and rebalance rows from default into it in one go.

So, rows inserted into the default table that correspond to a future time bucket are like a ticking time bomb. You'll need to step in and deal with them before they cause issue.

If you're never expecting to insert rows before or far beyond your retention window, it might be best to not have a default table. That will make the most noise in these exceptional circumstances, which may be desirable, as your system is likely in a bad state.

Roll your own

While pg_cron and pg_partman are great solutions, sometimes you want more control over your retention strategy. Rolling your own solution can make sense when you want to:

Keep retention logic in your application layer
Want to avoid complexity of pg_partman
Want to plug into your existing monitoring and alerting

You can roll your own flavor of pg_partman with a background worker in your application. I'll show you what this looks like in Elixir. We'll use Oban, a job processing library for Elixir:

The table

The parent table can look like this:

create table events (
  id bigserial,
  inserted_at timestamp with time zone not null,
  payload jsonb,
  -- other columns...
  primary key (id, inserted_at)
) partition by range (inserted_at);

The background worker

The background worker will need to:

Assess which partitions exists
Drop partitions that are older than your retention window
Create new partitions as needed

First, create a module to handle partition management:

defmodule MyApp.Partitions do
  @moduledoc """
  Manages partitioned tables in the database.
  """

  import Ecto.Query
  alias Sequin.Repo

  @doc """
  Formats a date into the pg_partman style partition suffix.
  Example: 2024-10-29 -> "p20241029"
  """
  def format_partition_name(date) do
    "p" <> Calendar.strftime(date, "%Y%m%d")
  end

  @doc """
  Parses a date from a pg_partman style partition name.
  Example: "events_p20241029" -> ~D[2024-10-29]
  """
  def parse_partition_date(partition_name) do
    [date_str] =
      ~r/_p(\d{8})$/
      |> Regex.run(partition_name, capture: :all_but_first)

    <<year::binary-size(4), month::binary-size(2), day::binary-size(2)>> = date_str
    Date.from_iso8601!(year <> "-" <> month <> "-" <> day)
  end

  @doc """
  Lists all partitions for a given table.

  Returns a list of tuples with the partition name and the date the partition corresponds to.
  """
  def list_partitions(table_name) do
    query = """
    select child.relname as partition_name
    from pg_inherits
      join pg_class parent on pg_inherits.inhparent = parent.oid
      join pg_class child on pg_inherits.inhrelid = child.oid
    where parent.relname = $1
      and child.relkind = 'r'
    order by child.relname;
    """

    with {:ok, %{rows: rows}} <- Repo.query(query, [table_name, table_schema]),
         rows <- List.flatten(rows) do
      partitions =
        rows
        |> Enum.map(fn [name] ->
          {name, parse_partition_date(name)}
        end)

      {:ok, partitions}
    end
  end

  @doc """
  Creates a partition for the given date if it doesn't exist
  """
  def create_partition(table, date) do
    partition_name = "#{table}_#{format_partition_name(date)}"
    start_time = partition_start(date)
    end_time = partition_end(date)
    
    query = """
    create table #{partition_name}
    partition of #{table}
    for values from ('#{start_time}'::timestamptz)
    to ('#{end_time}'::timestamptz)
    """
    
    with {:ok, _} <- Repo.query!(query) do
      :ok
    end
  end
  
  @doc """
  Drops a partition table
  """
  def drop_partition(partition_name) do
    query = "drop table #{partition_name}"
    
    with {:ok, _} <- Repo.query!(query) do
      :ok
    end
  end

   @doc """
  Returns the start of the day for the given date in UTC
  """
  def partition_start(date) do
    date
    |> DateTime.new!(Time.new!(0, 0, 0), "Etc/UTC")
    |> DateTime.truncate(:second)
  end
  
  @doc """
  Returns the end of the day for the given date in UTC
  """
  def partition_end(date) do
    date
    |> Date.add(1)
    |> partition_start()
  end
end

Then, you can create a background worker that uses this module to manage your partitions:

defmodule MyApp.PartitionMaintenanceWorker do
  use Oban.Worker

  alias MyApp.Partitions
  
  @retention_days 30
  @future_days 5
  
  @impl Oban.Worker
  def perform(%Oban.Job{args: %{"table_name" => table_name}}) do
    with :ok <- drop_old_partitions(table_name),
         :ok <- create_future_partitions(table_name) do
      :ok
    end
  end
  
  defp drop_old_partitions(table_name) do
    cutoff = Date.utc_today() |> Date.add(-@retention_days)
    
    with {:ok, partitions} <- Partitions.list_partitions(table_name) do
      partitions
      |> Enum.each(fn {name, date} ->
        if Date.compare(date, cutoff) == :lt do
          :ok = Partitions.drop_partition(name)
        end
      end)

      :ok
    end
  end
  
  defp create_future_partitions(table_name) do
    today = Date.utc_today()
    
    with {:ok, partitions} <- Partitions.list_partitions(table_name) do
      existing_partitions =
        partitions
        |> Enum.map(fn {_, date} -> date end)
        |> MapSet.new()
    
      today
      |> Date.range(Date.add(today, @future_days))
      |> Enum.reject(&MapSet.member?(existing_partitions, &1))
      |> Enum.each(fn date ->
        :ok = Partitions.create_partition(table_name, date)
      end)

      :ok
    end
  end
end

Choosing a retention strategy

I'm a big fan of starting simple, and pg_cron's simplicity is hard to beat. It's easy to understand, easy to set up, and with Postgres 17's improved vacuum performance, it can handle surprisingly high volumes.

Or, instead of pg_cron, you can even use your existing job processing system to run deletes on a regular interval.

However, there are some maintenance considerations with cron-based deletes. With pg_cron, you'll want to keep an eye on cron.job_run_details to ensure jobs are running successfully and not taking too long. And your biggest risk is falling behind. If jobs are silently failing and a huge backlog accumulates, you may get yourself into a situation where the backlog is difficult to clear.

So, you'll want to feed stats from a cron-based system into your monitoring system so you don't lose visibility.

pg_partman is more sophisticated and far more efficient, but also more complicated. It guarantees your bottleneck will be your write throughput, not your delete throughput. Its performance is more predictable – dropping a partition takes roughly the same time whether it contains 1 row or 1 million rows.

If you're viewing your database via a database client often, you can also see if partitions are being dropped. And if you accumulate a large backlog of partitions, you can intervene manually easily and fast.

For monitoring, pg_partman works with another extension, pg_jobmon. You can also simply keep tabs on how many partitions you have.

Finally, with pg_partman as inspiration, rolling your own solution is very feasible. The nice part is that you can easily integrate it into your existing monitoring, logging, and alerting. You also get full control over your retention logic, and can manage migrations as needed with a system you understand. Of course, there's risk too: drop partition calls are not for the faint of heart!