Benchmarking pg_net: Part 1

Carter Pedersen

•

Sep 17, 2024

•

8 min read

We're Sequin, a Postgres CDC tool to streams and queues like Kafka, SQS, HTTP endpoints, and more. We were curious how pg_net stood up against our tool, which lead to some extensive benchmarking.

With pg_net, Postgres databases can make asynchronous HTTP requests. Supabase uses pg_net under the hood to power Supabase Webhooks, making it one of the key technologies for reacting to changes in Supabase.

I set out to benchmark pg_net to give the Supabase community the information needed to decide when pg_net will work for them and when they should evaluate other options. Turns out pg_net can be a fickle extension.

Along the way, I reproduce an issue where pg_net drops requests. Debugging led me down a rabbit hole on ephemeral port exhaustion and eventually into gcc compiler land.

Additional benchmark scenarios will be covered in future posts. Subscribe to be notified.

Methodology

For my testing I use the official Supabase Postgres Docker image which comes with pg_net installed. I used the latest version at the time of writing, Postgres 15.6.1 and pg_net 0.10.1.

This benchmark runs locally. The webhook target is a simple Elixir Phoenix server. The server simulates processing with a controller action that returns a 200 response immediately. Typically this handler returns in ~400 µs:

def handle(conn, _params) do
  send_resp(conn, 200, "")
end

Benchmark scenarios

Single insert performance
Batch insert performance
Batch insert performance with retries

Metrics to measure

For each scenario, I measured 2 metrics:

Throughput
Request success rate

Importantly, I also determined a throughput for which the success rate is reliably ~100%.

The bench

I used pgbench to perform transactions against the Postgres database:

pgbench -c 10 -j 2 -t 1000 -P 5 -h localhost -U postgres -f bench.sql postgres

This runs the bench.sql script 1000 times each per 10 connected clients. This simulates concurrent operations against your target table. The contents of bench.sql vary in the different scenarios. For single inserts we randomize some variables and perform an insert:

\set location_id random(1, 1000)
\set harvester_id random(1, 1000)
\set spice_quantity random(1000, 5000)
\set sandworm_encounters random(0, 5)
\set weather_id random(1, 5)

INSERT INTO spice_harvests (harvest_date, location, harvester_name, spice_quantity, sandworm_encounters, weather_condition)
VALUES (
    CURRENT_DATE, 
    'Location-' || :location_id, 
    'Harvester-' || :harvester_id, 
    :spice_quantity, 
    :sandworm_encounters, 
    (CASE :weather_id
        WHEN 1 THEN 'Sandstorm'
        WHEN 2 THEN 'Clear'
        WHEN 3 THEN 'Hot'
        WHEN 4 THEN 'Windy'
        ELSE 'Dusty'
    END)
);

I attached a function/trigger to the target table, spice_harvests, that initiates an API request via pg_net:

CREATE OR REPLACE FUNCTION spice_harvests_webhook()
    RETURNS TRIGGER
    LANGUAGE plpgsql
AS $$
BEGIN
    PERFORM net.http_post(
        url := 'http://host.docker.internal:4000/api/spice-harvest',
        body := NEW::jsonb,
        headers := '{"Content-Type": "application/json"}'::jsonb
    );

    RETURN NEW;
END;
$$;

CREATE TRIGGER spice_harvests_webhook_trigger
    AFTER INSERT ON spice_harvests
    FOR EACH ROW
    EXECUTE FUNCTION spice_harvests_webhook();

pgbench typically finishes the 1000 transactions in ~1s. We then allow pg_net the time necessary to work through the net.http_request_queue table.

First issues

I started my first bench with pg_net configured with all the defaults. Immediately, I hit concerningly low throughput and success rate figures:

➜  pg_net git:(main) ✗ go run main.go

Test Results:
=============
Success Rate: 94.72%
Throughput: 74.31 requests/second

Failed Requests by Status Code:
-------------------------------
| Status     | Count      |
|------------|------------|
| NULL       | 528        |

Failed Requests by Error Message:
---------------------------------
| Error                          | Count      |
|--------------------------------|------------|
| Timeout was reached            | 80         |
| Couldn't connect to server     | 53         |
| Server returned nothing (no... | 395        |

Over 5% of requests failed. On top of that, pg_net was only able to send <75 reqs/second. These results were far below the success rate and throughput that pg_net claims to hit, which is roughly 100% delivery at 1k/s.

As a sanity check, I wrote a quick script to verify that my server could handle the expected load. The Go script asynchronously sends HTTP posts to our webhook handler and checks the response code. The results show that the server is OK:

➜  pg_net git:(main) ✗ go run server_verification.go
Progress: 10000/10000 (100.00%) | Success Rate: 100.00%
Total requests: 10000
Successes: 10000
Failures: 0
Success Rate: 100.00%
Throughput: 1707.91 requests per second

This indicates that the pg_net failures originate somewhere between Postgres and the server, but not at the server itself.

To diagnose, I analyzed all the available error information from the pg_net tables:

SELECT
        count(id) AS count,
        status_code,
        error_msg
FROM
        net._http_response
GROUP BY
        status_code,
        error_msg
ORDER BY
        count DESC;

 count | status_code |                   error_msg
-------+-------------+-----------------------------------------------
  9691 |         200 |
   268 |             | Server returned nothing (no headers, no data)
    27 |             | Timeout was reached
    14 |             | Couldn't connect to server
(4 rows)

The failed requests all have null status codes and error messages that are either:

Timeout was reached
Couldn't connect to server
Server returned nothing (no headers, no data)

Checking the logs for the webhook handler, the Phoenix server shows no timeouts or disconnect error messages of any kind. The pg_net tables tell us no more.

The Postgres logs also contain no further information for debugging!

At this point I modified the spice_harvests_webhook() function to send empty payloads ({}::json) to see if that would help. It did not.

So far, it’s difficult to diagnose what might be wrong with our pg_net or Postgres configuration that is leading to the high rate of failures and low throughput.

Lowering the batch size from the default of 200 to 100 does help:

➜  pg_net git:(main) ✗ go run main.go
Test Results:
=============
Success Rate: 99.15%
Throughput: 50.26 requests/second

Failed Requests by Status Code:
-------------------------------
| Status     | Count      |
|------------|------------|
| NULL       | 17         |

Failed Requests by Error Message:
---------------------------------
| Error                          | Count      |
|--------------------------------|------------|
| Timeout was reached            | 17         |

We still have nearly 1% of requests failing though. This time the errors are entirely timeout errors.

Halving again to a batch_size of 50 finally gets us to 100% success rate.

➜  pg_net git:(main) ✗ go run main.go
Success Rate: 100.00%
Throughput: 97.36 requests/second

Timeout was reached

pg_net gives us no more information on the root causes of the failed requests. To investigate further, I dug into the underlying HTTP driver in pg_net: libCurl.

The timeout error_msg that ends up in the net._http_response table originates in libCurl. In other words, to diagnose why this error occurs in our setup, we need to determine why a low level networking library is unable to fulfill a request in it’s allotted timeout.

We can help reduce the incidence rate of this error by increasing the timeout_milliseconds parameter of the net.http_post function call. Back to the default batch_size of 200 we are up to ~98% success rate and ~100 reqs/sec when timeout is increased to 30 seconds. But we still have a significant number of errors:

➜  pg_net git:(main) ✗ go run main.go
Test Results:
=============
Success Rate: 98.41%
Throughput: 98.93 requests/second

Failed Requests by Status Code:
-------------------------------
| Status     | Count      |
|------------|------------|
| NULL       | 159        |

Failed Requests by Error Message:
---------------------------------
| Error                          | Count      |
|--------------------------------|------------|
| Timeout was reached            | 13         |
| Couldn't connect to server     | 15         |
| Server returned nothing (no... | 131        |

The most common error at higher throughput is Server returned nothing (no headers, no data).

We can see in the libCurl library that this error occurs when the request connection is closed:

Returns **CURLE_OK** if everything is okay, and a non-zero number for
errors. Returns **CURLE_GOT_NOTHING** if the associated connection is
closed.

So the issue is deeper even than libCurl. Time to dive into the networking layer.

Ephemeral port exhaustion

Using netstat to check all tcp connections on port 4000 (the target port of our webhook server) I find 1000+ connections in TIME_WAIT:

➜  pg_net git:(main) ✗ netstat -anp tcp | grep 4000
tcp4       0      0  127.0.0.1.4000         *.*                    LISTEN
tcp4       0      0  127.0.0.1.59696        127.0.0.1.4000         TIME_WAIT
tcp4       0      0  127.0.0.1.59656        127.0.0.1.4000         TIME_WAIT

....

tcp4       0      0  127.0.0.1.59669        127.0.0.1.4000         TIME_WAIT
tcp4       0      0  127.0.0.1.59591        127.0.0.1.4000         TIME_WAIT
tcp4       0      0  127.0.0.1.59594        127.0.0.1.4000         TIME_WAIT

A high number of TCP connections in the TIME_WAIT state between the Postgres server (where pg_net is running) and the HTTP server on port 4000 indicates ephemeral port exhaustion.

These errors occur because the operating system has run out of available ephemeral ports to establish new TCP connections, causing new connection attempts to fail.

Each TCP connection goes through several states during its lifecycle. After a connection is closed, it enters the TIME_WAIT state to ensure all packets have been properly received and to prevent delayed packets from interfering with new connections.

With a high rate of connections being opened and closed in the benchmark, the number of sockets in TIME_WAIT can quickly accumulate, consuming the pool of available ephemeral ports.

I can tweak my operating system to avoid ephemeral port exhaustion with several settings changes. On MacOS I set several inet / tcp settings to open more ports so that the ports in use would more quickly be available for re-use:

sudo sysctl -w net.inet.tcp.msl=1000
net.inet.tcp.msl: 15000 -> 1000

sudo sysctl -w net.inet.ip.portrange.hifirst=32768
net.inet.ip.portrange.hifirst: 49152 -> 32768

sudo sysctl -w net.inet.tcp.keepidle=10000
net.inet.tcp.keepidle: 7200000 -> 10000

sudo sysctl -w net.inet.tcp.keepintvl=5000
net.inet.tcp.keepintvl: 75000 -> 5000

sudo sysctl -w net.inet.tcp.keepcnt=5
net.inet.tcp.keepcnt: 8 -> 5

Unfortunately, this barely helped performance.

=============
Success Rate: 93.87%
Throughput: 42.83 requests/second

Failed Requests by Status Code:
-------------------------------
| Status     | Count      |
|------------|------------|
| NULL       | 613        |

Failed Requests by Error Message:
---------------------------------
| Error                          | Count      |
|--------------------------------|------------|
| Timeout was reached            | 13         |
| Couldn't connect to server     | 4          |
| Server returned nothing (no... | 596        |

Docker?

Let’s try pulling the Postgres + pg_net setup out of Docker to see if that helps the situation.

To do so, I run Postgres on bare metal with brew on MacOS. I have to build pg_net from source and enable it in postgresql.conf so the extension is available.

The pg_net repo has instructions to build from source. My first pass produced a clang error:

➜  pg_net git:(master) make
clang -Wall -Wmissing-prototypes -Wpointer-arith -Wdeclaration-after-statement -Werror=vla -Werror=unguarded-availability-new -Wendif-labels -Wmissing-format-attribute -Wcast-function-type -Wformat-security -fno-strict-aliasing -fwrapv -Wno-unused-command-line-argument -Wno-compound-token-split-by-macro -Wno-deprecated-non-prototype -O2 -std=c11 -Werror -Wno-declaration-after-statement   -DEXTVERSION=\"0.10.0\" -I. -I./ -I/opt/homebrew/include/postgresql@14/server -I/opt/homebrew/include/postgresql@14/internal -I/opt/homebrew/Cellar/icu4c/74.2/include -isysroot /Library/Developer/CommandLineTools/SDKs/MacOSX13.sdk -I/opt/homebrew/opt/openssl@3/include -I/opt/homebrew/opt/readline/include -I/opt/homebrew/Cellar/lz4/1.9.4/include   -c -o src/worker.o src/worker.c
clang: error: no such sysroot directory: '/Library/Developer/CommandLineTools/SDKs/MacOSX13.sdk' [-Werror,-Wmissing-sysroot]
make: *** [src/worker.o] Error 1

Claude suggests I install the developer toolkit from apple:

➜  pg_net git:(main) ✗ xcode-select --install

This resolves the build issue. Finally, I add pg_net to shared preload libraries, restart, Postgres and enable the extension. All these steps were previously handled by the Supabase Docker image.

SHOW config_file;
...
nvim /var/lib/postgresql/data/postgresql.conf
...
shared_preload_libraries = 'pg_net'

Success! Running bare metal Postgres with pg_net gets us to the promised ~1k/sec throughput at 100% success rate.

➜  pg_net git:(main) ✗ go run main.go
2024/09/12 18:31:11 Successfully connected to the database
2024/09/12 18:31:11 Tables truncated
2024/09/12 18:31:11 pgbench started
2024/09/12 18:31:27 Bench test completed
Test Results:
=============
Success Rate: 100.00%
Throughput: 937.19 requests/second

No failed requests by status code.

No failed requests by error message.

Conclusion

This benchmarking exercise has demonstrated that pg_net can indeed deliver the high throughput and reliability it promises—but only under the right conditions. When running pg_net on a bare-metal PostgreSQL installation, we achieved over 900 requests per second with a 100% success rate. This aligns with the performance metrics advertised by pg_net and shows its potential for handling high-load scenarios.

However, my journey to this point was far from straightforward. Using the out-of-the-box Supabase Docker image for local development led to significant performance issues, including a high rate of request failures and throughput far below expectations. These issues were compounded by the fact that pg_net is challenging to debug when something goes wrong. The lack of detailed error messages and the necessity to dive deep into system-level networking configurations—such as ephemeral port exhaustion—made troubleshooting a time-consuming endeavor.

This experience highlights a critical shortcoming: without robust retry mechanisms, you cannot guarantee that webhooks will reach their target HTTP endpoints. In production systems where reliability is paramount, relying solely on pg_net without implementing retries could lead to data inconsistencies or lost events.

In summary, while pg_net works as advertised in an optimal setup, developers should be cautious. The default Supabase Docker image may not be sufficient for reliable development or production use without significant tweaks and a deep understanding of the underlying systems. As I continue to benchmark pg_net with different batch sizes and larger payloads, I hope to uncover more insights into its capabilities and limitations.

Stay tuned for future posts where we'll dig deeper into batch insert performance and explore strategies to mitigate these challenges.