technology – András Tornai

Finding, fixing, and verifying a relay I never chose

A follow-up to The follow-up file, where I argued that shipping a fix isn’t done until you’ve verified it moved a number. Here’s a case where I did.

The symptom

One Sentry error: negotiation-failed, one user, Safari/Mac, about 19 events in 25 minutes, then it recovered on its own. The kind of thing you write off as a bad network. I checked instead.

The numbers

My app (maguskartya.app) is a peer-to-peer card game over WebRTC, built on PeerJS – a small library that opens the browser-to-browser connections. I queried 90 days of production traffic – 20 players, 64 game sessions:

58% of sessions hit a connection error, a blocking reconnect overlay, or repeated join attempts.
41% showed the overlay; 75% of players were affected.
May looked the same. Not new, not rare.

The cause

I never set iceServers, so PeerJS used its default config. That default already includes a TURN relay – a free, shared, rate-limited public one (turn:eu-0.turn.peerjs.com:3478, credentials peerjs/peerjsp, port 3478, no TLS). When a direct connection failed, players relayed through that shared box. Anyone behind a firewall that blocks 3478 had no fallback at all.

So I wasn’t missing a TURN server. I was unknowingly depending on a bad one.

The fix

A dedicated relay – Cloudflare Realtime TURN. A small Vercel function (turn-credentials.ts) verifies the Clerk session, checks an allowlist claim, and mints a 24h credential. The client fetches it, caches it 23h, and uses it for every peer connection it opens.

The one trap worth knowing: PeerJS’s config replaces the defaults, it doesn’t merge. Pass your own iceServers and you silently lose the default STUN. Re-add it:

			
const turn = Array.isArray(data.iceServers) ? data.iceServers : [data.iceServers];
// peerjs's config REPLACES the defaults - re-add STUN or you lose it.
const servers = [{ urls: 'stun:stun.l.google.com:19302' }, ...turn];

This is a deliberate fallback: if my credential endpoint is ever down, fetching a credential fails, and instead of erroring out, the client just creates the peer with no custom config – so PeerJS uses its old default, the public relay. Players stay connected on a worse relay rather than not connecting at all.

Did it work

Two matched 23-day windows, before and after the deploy, normalized per game so a drop means each game got smoother, not just that fewer people played:

Connection errors per game: 1.38 → 0.36 (-74%).
Reconnect overlays per game: 0.72 → 0.39 (-47%).
Games actually went up over the same period (130 → 161).

Direct check: 24% of established connections now relay through my own TURN – a quarter of connections genuinely need a relay, and now it’s mine, not the public one. Credential fetches succeed 100% of the time (202 fetches, 32 people).

Lessons

No iceServers in PeerJS doesn’t mean no relay – it means PeerJS’s shared public one. Find out how much of your traffic depends on it.
“Mostly works” isn’t a number. P2P failures self-heal, so a common problem looks like an edge case until you count it.
A meaningful fraction of users can’t connect directly and need a relay. If you didn’t set one up, you’re borrowing someone else’s.

Closing the loop between an anomaly and a verified fix

It has been a few years since the last post. The reason is the most ordinary one I could give: a small human moved into the house, evenings stopped being my own, and writing took a back seat. The nights are getting easier now, and a little of the old energy is back.

I have been working on a small side project – an online tabletop platform for an old Hungarian card game called M.A.G.U.S. (maguskartya.app). A few lessons have come out of it that I want to write down before they fade. This is the first: the gap between “I shipped a fix” and “I know the fix worked”.

Sentry tells you that something crashed. PostHog does not. PostHog tells you that something happened, and lets you look at the pattern – how often, in what context, and to whom. So when a rare-by-design guard suddenly fires 171 times across 13 users, you notice. Then you fix the cause. Then comes the part nobody talks about: verifying that the fix actually worked. Sentry closes that loop for you. PostHog doesn’t. This is the file I keep to close it manually. I call it FOLLOWUPS.md.

The two tools, drawn cleanly

A quick line on the split, because the rest of this post leans on it. Sentry is an error-monitoring product: it catches crashes – uncaught exceptions, broken promises, things the user sees as “something went wrong”. When you ship a fix, the resolved error stays resolved; if it comes back, Sentry tells you. PostHog is a product-analytics tool (similar in role to Mixpanel or Amplitude): it tracks behaviour – the events you chose to instrument, the patterns you chose to watch. A signal in PostHog is almost always a rate: a thing that should be rare is happening more than rare, and the job is to find what changed. Fixing the cause is half the work. Going back days later to confirm the rate actually dropped is the other half, and PostHog will not raise its hand for you.

A concrete example

The same project has a netaction_anomaly event – a tracker I added to log any time the receiver’s authority check on an incoming action fails. The game is peer-to-peer, so each client validates the actions it receives from peers; if a guest sends an action that should only come from the host, the check trips and the event fires. It is the kind of guard I expect to fire once a year, on a real attack or a real bug.

I wired it as a PostHog event rather than a Sentry warning. Both would work – Sentry can capture warnings without crashes – but PostHog gives me the rate and shape of any anomaly across users, not just an issue page.

PostHog showed me it had fired 171 times across 13 distinct users over the past six weeks. The problem was not that the guard fired – a single fire on a real attack or real bug would have been normal. The problem was the volume: a once-a-year guard does not fire 171 times in six weeks. The breakdown was clean: same action type, same reason: "host_only", same sender_role: "spectator". Every guest joining a game was tripping the guard. Nothing was broken from the user’s point of view – the game played fine – but the anomaly stream was now drowning in false positives, and any real anomaly would be lost in the noise.

A single <code>netaction_anomaly</code> event in PostHog. The property breakdown is the diagnosis in one screen: <code>action_type: PARTICIPANT_UPDATE</code>, <code>reason: host_only</code>, <code>sender_role: spectator</code>. Every guest join produced the same shape.
One of the 171 events. Person column redacted.

The diagnosis took an evening. The host broadcasts a “participants updated” message to everyone in the room (including the brand-new guest who just joined), and then immediately starts a per-peer sync to bring that new guest up to speed. The order was wrong: when the guest received the broadcast, its local model of who is in this room was still empty, so the sender (the host) looked like a spectator to the receiver’s authority check. The guard fired. The payload was applied anyway – I run guards in track-only mode, but that is a different post – so gameplay was unaffected. The data stream was the only thing that knew.

The fix was a one-line swap: do the per-peer sync first, then broadcast. PeerJS – a thin WebRTC-data-channel library this project uses for peer connections – preserves per-connection message order, so the joining guest now receives sync → broadcast in the correct sequence, and the guard does not fire.

I committed the fix. And then I almost moved on.

The entry

Instead, I wrote one paragraph in a file called FOLLOWUPS.md. It looked like this (redacted):

			
## 1 - netaction_anomaly (PARTICIPANT_UPDATE / host_only / sender=spectator)
- Found via: PostHog query on <date>. 171 events across 13 users since the
  project started tracking. Fires on essentially every game session with
  a guest joining.
- Hypothesis: host-side join-message ordering race. <details>
- Fix applied: swapped call order in PeerService at <file:line>. Commit <hash>.
- Fix date: <date>
- Deploy date: <date>  (cutoff for verification, set conservatively to land
  after build + CDN propagation)
- Verify by: <date + 7 days>
- Verification query:
    SELECT count(), count(DISTINCT person_id)
    FROM events
    WHERE timestamp >= <deploy>
      AND event = 'netaction_anomaly'
      AND properties.action_type = 'PARTICIPANT_UPDATE'
      AND properties.reason = 'host_only'
      AND properties.sender_role = 'spectator';
- Expected outcome: 0 anomalies AND at least one fresh guest join in the
  window. If anomalies > 0, the ordering race is not the sole source.
- Resolution: pending

		

The same template, as a standalone file, is in the repo for this post.

The entry takes about ten minutes to write. The cost is small. The value is that on the verify-by date – which is on my calendar – I run the query and I know.

The sanity check – the part nobody tells you

Look at the “expected outcome” line. It has two clauses: zero anomalies AND at least one fresh guest join in the window. The second clause is the easy one to leave out.

If I run the query a week later and see zero anomalies, I might be looking at a working fix. Or I might be looking at a week where nobody used the affected feature. The data does not distinguish between “the fix worked” and “the bug had no opportunity to fire”. Without a paired traffic query, “zero” is meaningless.

Every entry in the file now has the sanity check baked in. If the success metric is “X dropped to zero”, there is always a second query asking “did the flow that produces X actually happen?” If both queries answer yes, the fix is verified. If only one does, I have a problem and a clue.

Why structured beats “I’ll remember”

Three weeks later, the original investigation is gone from my head. The PostHog dashboard looks slightly different – different time range, different aggregation, slightly different filters. I cannot re-derive what I was looking at on the day I shipped the fix. The entry is the only artifact that survives. It does not have to be smart; it has to be specific enough that future-me, opening the file cold, can run the verification query and know which answer means “fixed”.

This sounds obvious. It is. Yet it is easy to skip, because the cost of writing the entry is paid the day you ship – when you are tired and the change works on your machine and you want to move on – and the value is paid weeks later. You do not notice the value until the first time you open an entry from a month ago and save yourself an hour.

What it isn’t

This is not a test suite. It does not run on every commit. It does not prevent the bug from coming back; it just catches it if it does.

It is not alerting. Nothing pages me when an entry’s verify-by date passes. The discipline is mine: every Monday I look at the file and run anything that’s due.

It is not a runbook. The entry assumes I wrote it; another engineer reading it cold would be missing context. For a team of one, that is fine. For a team of more, the entry would need to be slightly longer.

It is paperwork. Ten minutes per fix. The win is the small but consistent number of times I open an entry, run the query, and find that what I shipped did not actually move the metric. Without the file, I would have called those fixes done.

Tag: technology

PeerJS public TURN relay