With pg_net
, Postgres databases can make asynchronous HTTP requests. Supabase uses pg_net
under the hood to power Supabase Webhooks, making it one of the key technologies for reacting to changes in Supabase.
I set out to benchmark pg_net
to give the Supabase community the information needed to decide when pg_net
will work for them and when they should evaluate other options. Turns out pg_net
can be a fickle extension.
Along the way, I reproduce an issue where pg_net
drops requests. Debugging led me down a rabbit hole on ephemeral port exhaustion and eventually into gcc compiler land.
Additional benchmark scenarios will be covered in future posts. Subscribe to be notified.
Methodology
For my testing I use the official Supabase Postgres Docker image which comes with pg_net
installed. I used the latest version at the time of writing, Postgres 15.6.1 and pg_net
0.10.1.
This benchmark runs locally. The webhook target is a simple Elixir Phoenix server. The server simulates processing with a controller action that returns a 200 response immediately. Typically this handler returns in ~400 µs:
def handle(conn, _params) do
send_resp(conn, 200, "")
end
Benchmark scenarios
- Single insert performance
- Batch insert performance
- Batch insert performance with retries
Metrics to measure
For each scenario, I measured 2 metrics:
- Throughput
- Request success rate
Importantly, I also determined a throughput for which the success rate is reliably ~100%.
The bench
I used pgbench
to perform transactions against the Postgres database:
pgbench -c 10 -j 2 -t 1000 -P 5 -h localhost -U postgres -f bench.sql postgres
This runs the bench.sql
script 1000 times each per 10 connected clients. This simulates concurrent operations against your target table. The contents of bench.sql
vary in the different scenarios. For single inserts we randomize some variables and perform an insert:
\set location_id random(1, 1000)
\set harvester_id random(1, 1000)
\set spice_quantity random(1000, 5000)
\set sandworm_encounters random(0, 5)
\set weather_id random(1, 5)
INSERT INTO spice_harvests (harvest_date, location, harvester_name, spice_quantity, sandworm_encounters, weather_condition)
VALUES (
CURRENT_DATE,
'Location-' || :location_id,
'Harvester-' || :harvester_id,
:spice_quantity,
:sandworm_encounters,
(CASE :weather_id
WHEN 1 THEN 'Sandstorm'
WHEN 2 THEN 'Clear'
WHEN 3 THEN 'Hot'
WHEN 4 THEN 'Windy'
ELSE 'Dusty'
END)
);
I attached a function/trigger to the target table, spice_harvests
, that initiates an API request via pg_net
:
CREATE OR REPLACE FUNCTION spice_harvests_webhook()
RETURNS TRIGGER
LANGUAGE plpgsql
AS $$
BEGIN
PERFORM net.http_post(
url := 'http://host.docker.internal:4000/api/spice-harvest',
body := NEW::jsonb,
headers := '{"Content-Type": "application/json"}'::jsonb
);
RETURN NEW;
END;
$$;
CREATE TRIGGER spice_harvests_webhook_trigger
AFTER INSERT ON spice_harvests
FOR EACH ROW
EXECUTE FUNCTION spice_harvests_webhook();
pgbench
typically finishes the 1000 transactions in ~1s. We then allow pg_net
the time necessary to work through the net.http_request_queue
table.
First issues
I started my first bench with pg_net
configured with all the defaults. Immediately, I hit concerningly low throughput and success rate figures:
➜ pg_net git:(main) ✗ go run main.go
Test Results:
=============
Success Rate: 94.72%
Throughput: 74.31 requests/second
Failed Requests by Status Code:
-------------------------------
| Status | Count |
|------------|------------|
| NULL | 528 |
Failed Requests by Error Message:
---------------------------------
| Error | Count |
|--------------------------------|------------|
| Timeout was reached | 80 |
| Couldn't connect to server | 53 |
| Server returned nothing (no... | 395 |
Over 5% of requests failed. On top of that, pg_net
was only able to send <75 reqs/second. These results were far below the success rate and throughput that pg_net
claims to hit, which is roughly 100% delivery at 1k/s.
As a sanity check, I wrote a quick script to verify that my server could handle the expected load. The Go script asynchronously sends HTTP posts to our webhook handler and checks the response code. The results show that the server is OK:
➜ pg_net git:(main) ✗ go run server_verification.go
Progress: 10000/10000 (100.00%) | Success Rate: 100.00%
Total requests: 10000
Successes: 10000
Failures: 0
Success Rate: 100.00%
Throughput: 1707.91 requests per second
This indicates that the pg_net
failures originate somewhere between Postgres and the server, but not at the server itself.
To diagnose, I analyzed all the available error information from the pg_net
tables:
SELECT
count(id) AS count,
status_code,
error_msg
FROM
net._http_response
GROUP BY
status_code,
error_msg
ORDER BY
count DESC;
count | status_code | error_msg
-------+-------------+-----------------------------------------------
9691 | 200 |
268 | | Server returned nothing (no headers, no data)
27 | | Timeout was reached
14 | | Couldn't connect to server
(4 rows)
The failed requests all have null
status codes and error messages that are either:
- Timeout was reached
- Couldn't connect to server
- Server returned nothing (no headers, no data)
Checking the logs for the webhook handler, the Phoenix server shows no timeouts or disconnect error messages of any kind. The pg_net
tables tell us no more.
The Postgres logs also contain no further information for debugging!
At this point I modified the spice_harvests_webhook()
function to send empty payloads ({}::json
) to see if that would help. It did not.
So far, it’s difficult to diagnose what might be wrong with our pg_net
or Postgres configuration that is leading to the high rate of failures and low throughput.
Lowering the batch size from the default of 200 to 100 does help:
➜ pg_net git:(main) ✗ go run main.go
Test Results:
=============
Success Rate: 99.15%
Throughput: 50.26 requests/second
Failed Requests by Status Code:
-------------------------------
| Status | Count |
|------------|------------|
| NULL | 17 |
Failed Requests by Error Message:
---------------------------------
| Error | Count |
|--------------------------------|------------|
| Timeout was reached | 17 |
We still have nearly 1% of requests failing though. This time the errors are entirely timeout errors.
Halving again to a batch_size
of 50 finally gets us to 100% success rate.
➜ pg_net git:(main) ✗ go run main.go
Success Rate: 100.00%
Throughput: 97.36 requests/second
Timeout was reached
pg_net
gives us no more information on the root causes of the failed requests. To investigate further, I dug into the underlying HTTP driver in pg_net
: libCurl
.
The timeout error_msg
that ends up in the net._http_response
table originates in libCurl
. In other words, to diagnose why this error occurs in our setup, we need to determine why a low level networking library is unable to fulfill a request in it’s allotted timeout.
We can help reduce the incidence rate of this error by increasing the timeout_milliseconds
parameter of the net.http_post
function call. Back to the default batch_size
of 200 we are up to ~98% success rate and ~100 reqs/sec when timeout is increased to 30 seconds. But we still have a significant number of errors:
➜ pg_net git:(main) ✗ go run main.go
Test Results:
=============
Success Rate: 98.41%
Throughput: 98.93 requests/second
Failed Requests by Status Code:
-------------------------------
| Status | Count |
|------------|------------|
| NULL | 159 |
Failed Requests by Error Message:
---------------------------------
| Error | Count |
|--------------------------------|------------|
| Timeout was reached | 13 |
| Couldn't connect to server | 15 |
| Server returned nothing (no... | 131 |
The most common error at higher throughput is Server returned nothing (no headers, no data)
.
We can see in the libCurl library that this error occurs when the request connection is closed:
Returns **CURLE_OK** if everything is okay, and a non-zero number for
errors. Returns **CURLE_GOT_NOTHING** if the associated connection is
closed.
So the issue is deeper even than libCurl
. Time to dive into the networking layer.
Ephemeral port exhaustion
Using netstat
to check all tcp connections on port 4000 (the target port of our webhook server) I find 1000+ connections in TIME_WAIT
:
➜ pg_net git:(main) ✗ netstat -anp tcp | grep 4000
tcp4 0 0 127.0.0.1.4000 *.* LISTEN
tcp4 0 0 127.0.0.1.59696 127.0.0.1.4000 TIME_WAIT
tcp4 0 0 127.0.0.1.59656 127.0.0.1.4000 TIME_WAIT
....
tcp4 0 0 127.0.0.1.59669 127.0.0.1.4000 TIME_WAIT
tcp4 0 0 127.0.0.1.59591 127.0.0.1.4000 TIME_WAIT
tcp4 0 0 127.0.0.1.59594 127.0.0.1.4000 TIME_WAIT
A high number of TCP connections in the TIME_WAIT
state between the Postgres server (where pg_net
is running) and the HTTP server on port 4000
indicates ephemeral port exhaustion.
These errors occur because the operating system has run out of available ephemeral ports to establish new TCP connections, causing new connection attempts to fail.
Each TCP connection goes through several states during its lifecycle. After a connection is closed, it enters the TIME_WAIT
state to ensure all packets have been properly received and to prevent delayed packets from interfering with new connections.
With a high rate of connections being opened and closed in the benchmark, the number of sockets in TIME_WAIT
can quickly accumulate, consuming the pool of available ephemeral ports.
I can tweak my operating system to avoid ephemeral port exhaustion with several settings changes. On MacOS I set several inet / tcp settings to open more ports so that the ports in use would more quickly be available for re-use:
sudo sysctl -w net.inet.tcp.msl=1000
net.inet.tcp.msl: 15000 -> 1000
sudo sysctl -w net.inet.ip.portrange.hifirst=32768
net.inet.ip.portrange.hifirst: 49152 -> 32768
sudo sysctl -w net.inet.tcp.keepidle=10000
net.inet.tcp.keepidle: 7200000 -> 10000
sudo sysctl -w net.inet.tcp.keepintvl=5000
net.inet.tcp.keepintvl: 75000 -> 5000
sudo sysctl -w net.inet.tcp.keepcnt=5
net.inet.tcp.keepcnt: 8 -> 5
Unfortunately, this barely helped performance.
=============
Success Rate: 93.87%
Throughput: 42.83 requests/second
Failed Requests by Status Code:
-------------------------------
| Status | Count |
|------------|------------|
| NULL | 613 |
Failed Requests by Error Message:
---------------------------------
| Error | Count |
|--------------------------------|------------|
| Timeout was reached | 13 |
| Couldn't connect to server | 4 |
| Server returned nothing (no... | 596 |
Docker?
Let’s try pulling the Postgres + pg_net
setup out of Docker to see if that helps the situation.
To do so, I run Postgres on bare metal with brew
on MacOS. I have to build pg_net
from source and enable it in postgresql.conf
so the extension is available.
The pg_net repo has instructions to build from source. My first pass produced a clang
error:
➜ pg_net git:(master) make
clang -Wall -Wmissing-prototypes -Wpointer-arith -Wdeclaration-after-statement -Werror=vla -Werror=unguarded-availability-new -Wendif-labels -Wmissing-format-attribute -Wcast-function-type -Wformat-security -fno-strict-aliasing -fwrapv -Wno-unused-command-line-argument -Wno-compound-token-split-by-macro -Wno-deprecated-non-prototype -O2 -std=c11 -Werror -Wno-declaration-after-statement -DEXTVERSION=\"0.10.0\" -I. -I./ -I/opt/homebrew/include/postgresql@14/server -I/opt/homebrew/include/postgresql@14/internal -I/opt/homebrew/Cellar/icu4c/74.2/include -isysroot /Library/Developer/CommandLineTools/SDKs/MacOSX13.sdk -I/opt/homebrew/opt/openssl@3/include -I/opt/homebrew/opt/readline/include -I/opt/homebrew/Cellar/lz4/1.9.4/include -c -o src/worker.o src/worker.c
clang: error: no such sysroot directory: '/Library/Developer/CommandLineTools/SDKs/MacOSX13.sdk' [-Werror,-Wmissing-sysroot]
make: *** [src/worker.o] Error 1
Claude suggests I install the developer toolkit from apple:
➜ pg_net git:(main) ✗ xcode-select --install
This resolves the build issue. Finally, I add pg_net
to shared preload libraries, restart, Postgres and enable the extension. All these steps were previously handled by the Supabase Docker image.
SHOW config_file;
...
nvim /var/lib/postgresql/data/postgresql.conf
...
shared_preload_libraries = 'pg_net'
Success! Running bare metal Postgres with pg_net
gets us to the promised ~1k/sec throughput at 100% success rate.
➜ pg_net git:(main) ✗ go run main.go
2024/09/12 18:31:11 Successfully connected to the database
2024/09/12 18:31:11 Tables truncated
2024/09/12 18:31:11 pgbench started
2024/09/12 18:31:27 Bench test completed
Test Results:
=============
Success Rate: 100.00%
Throughput: 937.19 requests/second
No failed requests by status code.
No failed requests by error message.
Conclusion
This benchmarking exercise has demonstrated that pg_net
can indeed deliver the high throughput and reliability it promises—but only under the right conditions. When running pg_net
on a bare-metal PostgreSQL installation, we achieved over 900 requests per second with a 100% success rate. This aligns with the performance metrics advertised by pg_net
and shows its potential for handling high-load scenarios.
However, my journey to this point was far from straightforward. Using the out-of-the-box Supabase Docker image for local development led to significant performance issues, including a high rate of request failures and throughput far below expectations. These issues were compounded by the fact that pg_net
is challenging to debug when something goes wrong. The lack of detailed error messages and the necessity to dive deep into system-level networking configurations—such as ephemeral port exhaustion—made troubleshooting a time-consuming endeavor.
This experience highlights a critical shortcoming: without robust retry mechanisms, you cannot guarantee that webhooks will reach their target HTTP endpoints. In production systems where reliability is paramount, relying solely on pg_net
without implementing retries could lead to data inconsistencies or lost events.
In summary, while pg_net
works as advertised in an optimal setup, developers should be cautious. The default Supabase Docker image may not be sufficient for reliable development or production use without significant tweaks and a deep understanding of the underlying systems. As I continue to benchmark pg_net
with different batch sizes and larger payloads, I hope to uncover more insights into its capabilities and limitations.
Stay tuned for future posts where we'll dig deeper into batch insert performance and explore strategies to mitigate these challenges.