It started as a simple optimization task: make Skipper start faster. It ended as a detective story involving a race condition that was “impossible.”

Below is the full visual case study I presented to Concurrency and Operating Systems students. Full detailed blog can be found at Beyond Race Detectors.

📥 Download as PDF (For offline reading)


Presentation Title: The Case of the Stale Snapshot. Debugging a multithreaded mystery where the system lies to itself. By Pushpalanka Jayawardhana.

The Mission: Optimizing Zalando’s Skipper Ingress Controller for faster startup using parallel OPA instance loading.

The Phantom Error: A ‘Locked Room Mystery’ in Kubernetes where a pod reports less than 5% error rate with zero logs.

Ruling out suspects: Why this was not a buggy OPA policy or corrupted bundle download.

Reproduction Strategy: Forcing the bug out of hiding by increasing the scale of OPA bundles in the local setup.

The Impossible Transition: Logs revealing the Bundle Plugin flipping from OK back to NOT_READY without a valid code path.

The Smoking Gun: Logs proving the listener received a ‘Stale Snapshot’ message from the past, while the source of truth was already Healthy.

Digital Timeline Analysis: A visual diagram of the Race Condition. Routine A is preempted by the OS scheduler, waking up later to deliver stale data.

Code snippet that allowed stale snapshot

The Fix: Changing the code to bypass the notification channel and query the OPA Manager source of truth directly.

Lessons Learned: Why structured logs beat debuggers for concurrency, and why race detectors miss event ordering bugs.

Lessons Learned: Questioning assumptions and using AI as a productivity tool, not a debugger.

Beyond data races, to logic races. The hidden dangers of information propagation.

Case File Sources: Links to the GitHub Issue#8009, Skipper PR#3562, and original engineering blog posts.

Conclusion, with multi-threading time is not linear

Thank you slide.


Key Takeaways

  1. Logs > Debuggers: You cannot step-through a race condition in an IDE.
  2. Time is not linear: In multi-threaded systems, pre-emption means “later” in code does not mean “later” in time.
  3. The Quick Fix: Bypass the notification channel and query the Source of Truth directly.

Disclaimer:

Organic content from original blog : Beyond Race Detectors. Slides structure supported by NotebookLLM and beautified by NanoBanana in Google slides.