The 3am packet ❦ that bit every other Tuesday.
For eleven weeks our payments cluster lost 47 minutes of useful throughput on a fortnightly cadence. Pings were fine. TLS handshakes were fine. Then, around 03:04 UTC, the first SYN of every fresh flow began arriving like a postcard sent the long way. This is the log of how we found it — and why the answer was a stale ARP entry on a partner ISP's distribution switch in Frankfurt.
The first hint, May 5 — a customer ticket that should have been a ping spike.
The ticket came in at 09:12 local from a customer support escalation, not from our pager. That was the first clue I should have weighed more carefully — Prometheus had not lit a single rule. The merchant had complained that batch settlement files from his POS gateway timed out "around three in the morning, but only sometimes." Sometimes is the worst word in network engineering. Sometimes means the failure has a phase the dashboards can't see.
I pulled the customer's flows from our Grafana vpc-east-flows panel and the picture looked entirely banal: 99.6% success rate over thirty days, p99 RTT at 4.8 ms, no retransmit storms. The connection between his POS and our payments cluster crossed the private peering we operate with NorthRing Telecom, a regional ISP whose Frankfurt PoP sits one cage over from ours. Two physical fibers, LACP'd at 10G, BGP iBGP for the public side, a static L2 peering for the private side. Stable since 2023.
The day before, May 4, had been a Monday. The customer's batch had run fine. So had every Monday batch. The Sunday batch had been fine. The Saturday batch had been fine. I scrolled the calendar back and noticed, with the slowed-down attention you get only on the third coffee, that the timeout events clustered: April 21, April 7, March 24, March 10. Tuesdays. Every other Tuesday. Eleven weeks of them, hidden by averages, hidden by the fact each event was small enough to look like a generic timeout.
I ran a quick fping -l -p 200 10.42.7.5 against the customer's gateway from inside our cluster. Eleven thousand replies, zero loss, mean 4.81 ms, max 5.94 ms. The peering was not down. It was, on a particular schedule, getting confused for the first 0.8–1.4 seconds of the first new flow of the morning, and only the first one. The rest of the batch then queued behind a stack of retried SYNs and the gateway's idempotency window slammed shut at 60 seconds. By the time anyone noticed, the link was again indistinguishable from healthy.
This is the kind of bug I prefer to lose to. It isn't pretending to be a bigger problem. It isn't waving its hands. It does its damage, hides, and lets you blame the application. I closed my laptop and made a note in the incident log: see again on Tuesday May 19, 02:50 UTC, with a capture running.
Reading tcpdump like a diary, not a search query.
On Tuesday May 19 I came in at 02:30 UTC with a thermos and a Cat6 console cable. The capture I had planned was deliberately wide — broader than I usually allow myself, because I'd rather scroll a gigabyte of pcap than re-run a missed event. I started two tcpdumps, one on each side of the peering.
# on the cluster side, 10.40.0.0/16 internal sudo tcpdump -i ens5 -s 0 -w /var/cap/fra-05-19.pcap \ 'host 10.42.7.5 or net 192.0.2.0/24 or arp' \ -G 1800 -W 4 -Z root # 30-min rotating, 2h retained
# on the gateway facing the NorthRing peering, ens6 is the 10G LACP sudo tcpdump -i ens6 -nn -s 96 -tt -w /var/cap/peer-05-19.pcap \ '(vlan 412) and (tcp port 4443 or arp or icmp)'
The capture filter on ens6 matters more than you might think. NorthRing tags the private peering on VLAN 412; without the vlan 412 predicate, BPF on Linux misses the dot1q-tagged ARP entirely, because the kernel had already de-tagged for the rest of the stack. This is the kind of thing you learn once, painfully, and never forget. I had learned it once, painfully, in 2019, in a different city, debugging a different problem.
At 02:58:41 UTC the first sign appeared. A single ARP request from 192.0.2.1 asking for 192.0.2.5 — our side of the peering — broadcast on VLAN 412. Nothing unusual yet; ARP requests happen. What was unusual was that we had answered the same request from the same MAC fourteen times in the previous hour. Our gateway always replied within 40 microseconds. NorthRing's distribution switch, for some reason, was asking again every four to six minutes.
Then, at 03:04:09 UTC, two ARP replies arrived almost on top of each other for the same address. One came from our gateway's ec:0d:9a:73:1c:bf. The other came from 00:50:56:91:a0:42 — a VMware OUI. We do not run VMware on this network. We have not run VMware on this network since 2022. There it was, in the trace, calmly replying as if it owned the address. The L2 was forked.
The Wireshark filters that finally bit — and the ones that wasted my night.
Wireshark is a kind of dictionary you have to know how to read aloud. I spent the first two hours of May 20 going down filter rabbit holes that looked productive but weren't. Here, for the public record, are the ones that did not help, followed by the four that did.
| Filter expression | Role | Hits | Useful? |
|---|---|---|---|
tcp.analysis.retransmission |
noise sweep | 11,304 | No — overwhelmed by unrelated client churn. |
tcp.flags.syn == 1 and tcp.flags.ack == 0 |
flow opener | 2,981 | Partial — surfaced 03:04 SYNs but not their fate. |
arp.duplicate-address-detected |
L2 forensic | 2 | Yes. This is the hammer. |
arp.opcode == 2 and not arp.src.hw_mac == ec:0d:9a:73:1c:bf |
L2 forensic | 7 | Yes — exposed the ghost VMware MAC. |
tcp.stream eq 4471 && tcp.flags.syn |
flow trace | 3 | Yes — three retried SYNs spaced 1.0, 2.0, 4.0s. |
icmp.type == 3 and icmp.code == 1 |
unreachable | 0 | No — no host-unreachable from upstream. |
frame.time_delta_displayed > 0.5 |
timing | 38 | Yes — clustered around 03:04:09–03:04:21. |
vlan.id == 412 && arp |
scoped L2 | 416 | Yes — refined the haystack to one VLAN. |
The duplicate-address-detected expander told the story in one window. Wireshark had been quietly noticing, on every Tuesday capture, that 192.0.2.5 was being answered by two MACs within a few seconds of each other. We had two captures from previous Tuesdays — March 24 and April 7 — that I had filed and not opened. Both, when re-examined, contained the same two replies. The ghost MAC 00:50:56:91:a0:42 appeared, did its damage, and disappeared within a window of forty to sixty seconds. The distribution switch downstream, after this brief seizure, settled on our MAC again and the world looked normal.
tshark -r peer-05-19.pcap \ -Y 'arp.opcode==2 and arp.src.proto_ipv4==192.0.2.5' \ -T fields -e frame.time -e arp.src.hw_mac -e arp.src.proto_ipv4 # 03:04:09.412 ec:0d:9a:73:1c:bf 192.0.2.5 ← our gateway, expected # 03:04:09.488 00:50:56:91:a0:42 192.0.2.5 ← who are you? # 03:04:21.117 ec:0d:9a:73:1c:bf 192.0.2.5 ← gratuitous, ours
The cadence: why every other Tuesday, and why 03:04?
By Thursday morning I had a packet-level diagnosis but no why. The why is always the harder thing. A timeline helped — I sat down with the previous eleven incidents and the partner's public maintenance window page and worked through them in chronological order.
First quiet incident — five SYN retries, no alarm.
Logged in retrospect. A single customer batch from Heidelberg Office Supply hit a 4-second handshake on its first flow; the rest of the night was clean. No ticket filed at the time.
Second incident — two customers, both timed out within 90 seconds.
Both customers retried successfully on the next attempt. Total user-visible degradation: roughly 110 seconds. No SEV opened.
Third — a quiet pattern hardens.
Identical timing. The merchant gateway operator noted in his own logs: "appears as if the route flipped for 8 seconds." He blamed his ISP. So would I have.
Customer ticket finally lands.
The Heidelberg merchant's batch fails three times consecutively before succeeding. His support team escalates to ours. Ticket misrouted to the application team for six days.
Two merchants drop in tandem — first real escalation.
The pattern is now undeniable. The fortnightly cadence, the 03:04 onset, the L2 fingerprint. I open the May 19 capture plan.
The capture night — duplicate ARP captured live.
Two MACs, twelve seconds apart. The ghost VMware OUI confirmed in three independent pcap segments. Cluster gateway issues its own gratuitous ARP at 03:04:21 and the L2 forest settles.
NorthRing concedes a maintenance script.
After the capture and a screen-share, NorthRing's NOC identifies a legacy Ansible playbook that, every other Tuesday at 03:04, flushes-and-rebuilds the ARP table on NR-DIST-2 in Frankfurt. The flush is fine. The rebuild reads from a stale inventory file.
Once you see it, you cannot unsee it. The fortnightly cadence wasn't mysterious — it was a cron job. The 03:04 onset wasn't poetic — it was the third minute of a four-minute maintenance script. The ghost MAC was a literal ghost: an Ansible inventory entry from a server that had been decommissioned thirty-eight months earlier but never removed from a group_vars/static_arp.yml file. It returned, faithfully, every fourteen days, like a sleepwalker.
The fortnightly cadence wasn't mysterious. It was a cron job reading a stale inventory file from 2022.
An ARP entry that refused to die — and what it taught us about state.
An ARP entry isn't supposed to have a life of its own. It is supposed to be transient, lazy, and replaced by whatever the wire most recently said. The whole point of the protocol is that the network is what it appears to be at any given moment, and the moment is short. We forget how fragile that contract is once a static override enters the picture.
NorthRing's playbook did not, as I'd first feared, push a broken configuration into running config. The damage was subtler. After the flush, the playbook walked the static_arp.yml manifest and inserted each entry into the running ARP table with a 240-second timeout. For 240 seconds, the distribution switch believed our peering address belonged to a server that had not existed for thirty-eight months. Any frame destined for 192.0.2.5 during that window was forwarded toward a MAC that no port on the switch had seen in years — and the switch, ever obliging, flooded those frames out every port in VLAN 412 looking for a home.
That flooding wasn't free. It pushed the per-port input queue on NorthRing's downstream distribution port up to about 38% of its 1G capacity for the duration of the flood. Most of the time, the queue absorbed it. Sometimes — and here is where the "every other Tuesday" pattern took its specific shape — a coincident burst of legitimate batch traffic on the same VLAN tipped the queue over its early-drop threshold, and SYNs were dropped silently. Not RST'd. Not unreachable'd. Dropped, like postcards into a drawer.
The cure was both trivial and embarrassing. NorthRing rewrote the playbook to read the static-ARP manifest from a diff against the live ARP table, rather than as a blind replace. They also, finally, removed the eleven ghost entries. We watched the next Tuesday — June 2, 03:04 UTC — together over a video call. The capture showed exactly one ARP request, exactly one reply from us, exactly zero ghost MACs. The fortnight after that, the same. The fortnight after that, the same.
If this story has a moral, it is the one every network engineer eventually carves into the inside of their forehead: infrastructure that survives its owner is infrastructure waiting to bite. The VMware host died in 2022. Its IP address survived in someone's YAML, and its MAC address survived in someone's arp -s, and on a fortnightly basis it remembered itself.
Talking to the partner ISP — what we asked, what we got, what we paid for in coffee.
A polite, evidence-laden e-mail to NorthRing's NOC at 03:00 UTC on Sunday May 3 was the single most leveraged action of this entire investigation. I attached three pcap excerpts, a timeline of incidents, and the exact filter that surfaced the ghost ARP. I did not speculate about cause. I described, in flat declarative prose, what I had observed.
What I sent at 03:00 on Sunday.
One paragraph of context. Three pcap files, each under 4 MB, named for the date they were taken. One screenshot of the duplicate-address-detected expander. A single question: "Is there scheduled work on VLAN 412 fortnightly at 03:00 UTC?"
What they showed me on Tuesday afternoon.
Their senior NOC engineer, Bartłomiej K., opened the Ansible playbook in front of me. Line 47 was a loop over static_arp.yml. He'd inherited the playbook from a colleague who'd left in late 2023. Nobody had touched static_arp.yml since.
What we agreed to do, and by when.
Within 24 hours: ghost entries removed. Within seven days: playbook rewritten to diff. Within thirty days: shared dashboard for VLAN 412 anomalies. We met all three deadlines — they on day 2, day 4, day 21; we on day 9.
# Run on a payments gateway, watches for ghost MAC reappearance. # Lives in /usr/local/sbin/arp-ghost-watch — runs from systemd timer. #!/usr/bin/env bash target="192.0.2.5" expected_mac="ec:0d:9a:73:1c:bf" ts="$(date -u +%FT%TZ)" seen="$(arping -c 3 -I ens6 -f $target | awk '/reply/ {print $5; exit}')" if [[ "$seen" != "$expected_mac" ]]; then logger -t arp-ghost-watch "GHOST $ts saw=$seen expected=$expected_mac" curl -sS -X POST "$ALERT_URL" -d "ghost-arp $ts $seen" fi
The watcher has fired exactly zero true positives since June 2. It has fired twice during planned maintenance on our side and both alerts were correctly suppressed by the silencing label. I don't trust the silence. I trust the watcher. I will keep it running for at least a year.
Lessons, runbooks, and the things we changed in the dark.
A postmortem isn't worth filing if it doesn't change the rooms it touches. Here are the seven things this incident changed, ordered by usefulness rather than chronology.
| Change | Role | Cost (h) | Owner |
|---|---|---|---|
| Add duplicate-address-detected as a Prom alert via a sidecar tshark. | detection | 6.0 | Maren A. |
| Quarterly review of static-ARP manifests on both sides of all peerings. | hygiene | 2.5 / q | NetOps |
| Move every "every other Tuesday" partner maintenance into shared calendar with annotations on our Grafana. | visibility | 3.5 | SRE liaison |
| Add gratuitous-ARP from our gateway every 60 s during 03:00–04:00 UTC on Tuesdays (temporary, until June 30 2026). | mitigation | 1.5 | Maren A. |
| Customer-facing status page now distinguishes peering events from backbone events. | comms | 4.0 | Status WG |
| Document this incident in the on-call runbook with the exact Wireshark filter that bit. | runbook | 2.0 | Maren A. |
| Push for a formal L2 health metric in the next NorthRing peering contract renewal. | commercial | 12.0 | Procurement |
The thing I keep returning to, the thing I want every junior engineer at our cage to take from this story, is the cadence of the work itself. I did not discover the ghost ARP. I let the network discover it for me, by waiting until 02:30 UTC on a Tuesday with a thermos and two tcpdumps. The patient hours of capture were the entire investigation. Wireshark filters are tools, not searches; they help you read a thing you already have rather than fetch a thing you don't.
Frequently asked, often badly.
Why didn't you see this in Prometheus from day one?
duplicate_address_detected_total per VLAN per minute, alerting on any non-zero value.Couldn't you have just pinned the MAC with arp -s on your side?
Why didn't the gratuitous ARP from your gateway clear the cache faster?
SYN-RST retransmit at 1.0 s.