☕ Welcome to The Coder Cafe! These days, most posts about AI for production circle the same ideas: automated remediation, anomaly detection, alerting triage, etc. These are interesting starting points, but they share a common assumption: that AI’s job is to replace what SREs do. In this post, I want to explore the idea of having AI as a cognitive partner, something that extends what a single engineer can hold in their head at once. Get cozy, grab a coffee, and let’s begin!
At Google, I’m an SRE on the Google Distributed Cloud team, where the infrastructure stack spans Kubernetes, Borg, distributed storage, virtualization, networking, and more. Over the past months, I’ve been experimenting with ways AI can help not only by automating work away, but also by reducing the cognitive overhead that makes production work quite overwhelming sometimes.
Here are three directions that changed how I thought about the problem.
Situation Awareness
In my team, we have hundreds of dashboards. Kubernetes clusters, Borg jobs, storage metrics, VM utilization, network metrics, etc. Each one tells part of the story.
When something went wrong, and I wanted to understand the current state of the system, I needed to spend a significant amount of time opening tabs and cross-referencing panels to get a complete picture.
This is a fundamentally human bottleneck. Each dashboard was designed to answer a specific question. The question “What is the current situation?” doesn’t map to any single dashboard, and navigating all of them to reconstruct an answer takes time we often don’t have.
Interestingly, this is where AI can change the equation. Instead of navigating dashboards, imagine describing your system to an AI agent with access to your observability stack and simply asking: “What’s going on?” The agent queries across your telemetry data, picks out what stands out, and hands you back a coherent narrative, something you can actually act on. Like: “This specific cluster has an issue with all the containers using distributed storage running on that specific node since 2h.”
This shifts the focus from navigator (opening dashboards one by one) to interpreter (acting on a synthesized summary). And that shift matters: every minute you spend navigating is a minute you're not spending on the actual problem.
Telemetry Archaeology
A few months ago, I was investigating a storage incident on a cluster. The failure itself was clear: a disk issue that surfaced as elevated latency and eventually a service degradation. What wasn’t clear was why it happened when it did.
I used Gemini CLI to navigate the metrics data around the event window. What it surfaced surprised me: the root cause signals had been present in the telemetry hours before the incident triggered any alert. Subtle correlations across metrics that individually looked like noise: disk read latency creeping slightly upward, I/O wait ticking up on specific nodes, a minor memory pressure pattern. Together, they pointed directly at the failure that was coming.
A human reviewing those dashboards in real time would almost certainly have missed it. Each individual signal was within an acceptable range. The pattern only became visible when we looked at all of them together, across time.
This is what I’d call telemetry archaeology: using AI to go back through your metrics data and surface the correlations an alerting system wasn’t designed to catch.
It’s worth being precise about what makes this different from anomaly detection. Anomaly detection tells you when something looks wrong. Telemetry archaeology is about finding the patterns that appear before anything looks wrong at all, relationships that no one thought to encode into an alert, because no one knew they existed until the incident happened.
The practical implication is significant. If these correlations exist in your past incidents, they likely exist in future ones. An AI agent that continuously monitors for these multi-signal patterns could surface a warning (”This looks like the early stages of what happened last time”) long before your system starts showing symptoms.
Incident Co-Pilot
Active incidents can be cognitively brutal. You can be debugging a live system, managing communication with stakeholders, coordinating with other engineers, and trying to remember what you checked 20 minutes ago, all at the same time.
A common consequence is that the engineer with the deepest system knowledge gets pulled out of deep focus to write status updates, summarize what’s been tried, and maintain a running timeline. This work is necessary, but it’s expensive. Every context switch makes it harder to hold the full mental model of the incident in your head. And once that model fragments, rebuilding it takes time you don’t have.
NOTE: This is actually one of the reasons Google developed the IMAG process, with clear role separation: The Incident Commander (IC) coordinates the overall response, the Communications Lead (CL) handles stakeholder updates, and the Operations Lead (OL) focuses on mitigating the issue. The explicit goal is to prevent any single person from being pulled in too many directions at once.
AI can absorb most of this overhead. Think of it as a second brain that’s been in the room the whole time: it tracks what hypotheses have been tested, which ones were ruled out and why, what changed in the system during the incident window, and what hasn’t been explored yet. When a new engineer joins the investigation, instead of spending ten minutes getting them up to speed, you ask the AI for a summary.
AI’s role here is handling the administrative layer of the incident: the parts that pull you out of flow, so you can stay in the problem instead of constantly being yanked out of it.
I’ve been using AI this way during my own shifts. Even without a purpose-built tool, maintaining a running log with AI (e.g., what we’ve tried, what we know, what’s next) noticeably changes how an incident feels.
Summary
The common “AI for production” narrative focuses on automation and replacement; cognitive augmentation is the underexplored angle.
Situation awareness: AI can synthesize across hundreds of dashboards to answer “What’s the current situation?” in seconds, shifting your role from navigator to interpreter.
Telemetry archaeology: AI can surface hidden correlations across metrics that individually look like noise, revealing root cause signals that were present hours before any alert fired.
Incident co-pilot: AI can absorb the administrative layer of an active incident (status updates, running timeline, hypothesis tracking), keeping the engineer in deep focus instead of constant context switching.
None of this requires replacing the engineer. The value is in extending what one person can hold in their head under pressure.




