
Beyond Race Detectors: First-hand Experience Debugging a Multi-threaded Stale Data Issue
During a new feature rollout of Skipper (An opensource ingress controller from Zalando), we hit a puzzling issue: <5% of requests to a specific route failed consistently in one pod, while the same configuration worked perfectly everywhere else. The culprit? A timing-dependent bug in how OPA’s plugin manager handles state notifications - one that Go’s race detector won’t catch. (If a more visual representation is appealing to you, refer the slide deck shared at https://pushpalanka.com/posts/stale-snapshot-case. It was prepared as an industry example for concurrency and operating system students.) ...