
The Case of the Stale Snapshot: Tracking a Logical Race in Real-World Systems
How I tracked down a ‘Phantom Error’ in a high performance Authorization layer that defied logic, race detectors, and logs.

How I tracked down a ‘Phantom Error’ in a high performance Authorization layer that defied logic, race detectors, and logs.

During a new feature rollout of Skipper (An opensource ingress controller from Zalando), we hit a puzzling issue: <5% of requests to a specific route failed consistently in one pod, while the same configuration worked perfectly everywhere else. The culprit? A timing-dependent bug in how OPA’s plugin manager handles state notifications - one that Go’s race detector won’t catch. (If a more visual representation is appealing to you, refer the slide deck shared at https://pushpalanka.com/posts/stale-snapshot-case. It was prepared as an industry example for concurrency and operating system students.) ...

Background This learning comes from a project aimed at providing Externalized Authorization as a Service (AaaS), integrated directly into the platform. The solution leverages Open Policy Agent (OPA) as the Policy Decision Point (PDP), with policy enforcement handled by Skipper — an open-source ingress controller and reverse proxy. Skipper integrates with OPA to serve as the Policy Enforcement Point (PEP). For a detailed overview, refer to this Zalando Engineering blog. As an ingress controller, Skipper is designed to introduce minimal overhead to requests. Given the large number of deployed instances, any inefficiency in resource allocation can quickly scale into significant costs. Similarly, even a delay of just a few milliseconds per request becomes expensive when multiplied across thousands of requests flowing through Skipper. Now should we save the goat or the cabbages? ...