Closing the loop between an anomaly and a verified fix
It has been a few years since the last post. The reason is the most ordinary one I could give: a small human moved into the house, evenings stopped being my own, and writing took a back seat. The nights are getting easier now, and a little of the old energy is back.
I have been working on a small side project – an online tabletop platform for an old Hungarian card game called M.A.G.U.S. A few lessons have come out of it that I want to write down before they fade. This is the first: the gap between “I shipped a fix” and “I know the fix worked”.
Sentry tells you that something crashed. PostHog does not. PostHog tells you that something happened, and lets you look at the pattern – how often, in what context, and to whom. So when a rare-by-design guard suddenly fires 171 times across 13 users, you notice. Then you fix the cause. Then comes the part nobody talks about: how do you actually verify the fix? Sentry closes that loop for you. PostHog doesn’t. This is the file I keep to close it manually. I call it FOLLOWUPS.md.
The two tools, drawn cleanly
A quick line on the split, because the rest of this post leans on it. Sentry is an error-monitoring product: it catches crashes – uncaught exceptions, broken promises, things the user sees as “something went wrong”. When you ship a fix, the resolved error stays resolved; if it comes back, Sentry tells you. PostHog is a product-analytics tool (similar in role to Mixpanel or Amplitude): it tracks behaviour – the events you chose to instrument, the patterns you chose to watch. A signal in PostHog is almost always a rate: this thing should be rare, it is happening more than rare, what changed? Fixing the cause is half the work. Going back days later to confirm the rate actually dropped is the other half, and PostHog will not raise its hand for you.
A concrete example
The same project has a netaction_anomaly event – a tracker I added to log any time the receiver’s authority check on an incoming action fails. The game is peer-to-peer, so each client validates the actions it receives from peers; if a guest sends an action that should only come from the host, the check trips and the event fires. It is the kind of guard I expect to fire once a year, on a real attack or a real bug.
I wired it as a PostHog event rather than a Sentry warning. Both would work – Sentry can capture warnings without crashes – but PostHog gives me the rate and shape of any anomaly across users, not just an issue page.
PostHog showed me it had fired 171 times across 13 distinct users over the past six weeks. The problem was not that the guard fired – a single fire on a real attack or real bug would have been normal. The problem was the volume: a once-a-year guard does not fire 171 times in six weeks. The breakdown was clean: same action type, same reason: "host_only", same sender_role: "spectator". Every guest joining a game was tripping the guard. Nothing was broken from the user’s point of view – the game played fine – but the anomaly stream was now drowning in false positives, and any real anomaly would be lost in the noise.
The diagnosis took an evening. The host broadcasts a “participants updated” message to everyone in the room (including the brand-new guest who just joined), and then immediately starts a per-peer sync to bring that new guest up to speed. The order was wrong: when the guest received the broadcast, its local model of who is in this room was still empty, so the sender (the host) looked like a spectator to the receiver’s authority check. The guard fired. The payload was applied anyway – I run guards in track-only mode, but that is a different post – so gameplay was unaffected. The data stream was the only thing that knew.
The fix was a one-line swap: do the per-peer sync first, then broadcast. PeerJS preserves per-connection message order, so the joining guest now receives sync → broadcast in the correct sequence, and the guard does not fire.
I committed the fix. And then I almost moved on.
The entry
Instead, I wrote one paragraph in a file called FOLLOWUPS.md. It looked like this (redacted):
## 1 - netaction_anomaly (PARTICIPANT_UPDATE / host_only / sender=spectator)- Found via: PostHog query on <date>. 171 events across 13 users since the project started tracking. Fires on essentially every game session with a guest joining.- Hypothesis: host-side join-message ordering race. <details>- Fix applied: swapped call order in PeerService at <file:line>. Commit <hash>.- Fix date: <date>- Deploy date: <date> (cutoff for verification, set conservatively to land after build + CDN propagation)- Verify by: <date + 7 days>- Verification query: SELECT count(), count(DISTINCT person_id) FROM events WHERE timestamp >= <deploy> AND event = 'netaction_anomaly' AND properties.action_type = 'PARTICIPANT_UPDATE' AND properties.reason = 'host_only' AND properties.sender_role = 'spectator';- Expected outcome: 0 anomalies AND at least one fresh guest join in the window. If anomalies > 0, the ordering race is not the sole source.- Resolution: pending
The same template, as a standalone file, is in the repo for this post.
The entry takes about ten minutes to write. The cost is small. The value is that on the verify-by date – which is on my calendar – I run the query and I know.
The sanity check – the part nobody tells you
Look at the “expected outcome” line. It has two clauses: zero anomalies AND at least one fresh guest join in the window. The second clause is the easy one to leave out.
If I run the query a week later and see zero anomalies, I might be looking at a working fix. Or I might be looking at a week where nobody used the affected feature. The data does not distinguish between “the fix worked” and “the bug had no opportunity to fire”. Without a paired traffic query, “zero” is meaningless.
Every entry in the file now has the sanity check baked in. If the success metric is “X dropped to zero”, there is always a second query asking “did the flow that produces X actually happen?” If both queries answer yes, the fix is verified. If only one does, I have a problem and a clue.
Why structured beats “I’ll remember”
Three weeks later, the original investigation is gone from my head. The PostHog dashboard looks slightly different – different time range, different aggregation, slightly different filters. I cannot re-derive what I was looking at on the day I shipped the fix. The entry is the only artifact that survives. It does not have to be smart; it has to be specific enough that future-me, opening the file cold, can run the verification query and know which answer means “fixed”.
This sounds obvious. It is. Yet it is easy to skip, because the cost of writing the entry is paid the day you ship – when you are tired and the change works on your machine and you want to move on – and the value is paid weeks later. You do not notice the value until the first time you open an entry from a month ago and save yourself an hour.
What it isn’t
This is not a test suite. It does not run on every commit. It does not prevent the bug from coming back; it just catches it if it does.
It is not alerting. Nothing pages me when an entry’s verify-by date passes. The discipline is mine: every Monday I look at the file and run anything that’s due.
It is not a runbook. The entry assumes I wrote it; another engineer reading it cold would be missing context. For a team of one, that is fine. For a team of more, the entry would need to be slightly longer.
It is paperwork. Ten minutes per fix. The win is the small but consistent number of times I open an entry, run the query, and find that what I shipped did not actually move the metric. Without the file, I would have called those fixes done.