The Hidden Dangers of AI Agents in Production
As AI agents become increasingly prevalent in production environments, a new category of incidents is emerging that engineering teams are not equipped to track or handle.

The Hidden Dangers of AI Agents in Production">
A growing number of production incidents are being caused by AI agents, but they don't fit into any existing postmortem template. These incidents occur when an agent initiates an action that is technically correct given its context, but the context is incomplete, leading to a cascade of failures. The scale of this exposure is significant, with 79% of organizations now having some form of AI agent in production, and 96% planning to expand their use. Gartner predicts that 33% of enterprise software will include agentic AI by 2028, but warns that 40% of those projects will be canceled due to poor risk controls.
The problem lies in the fact that autonomous agents and chaos engineering are treated as separate disciplines. However, they are not. They are the same discipline, and the gap between them is quietly generating the next wave of major production incidents. When a human engineer initiates a chaos experiment, they make a judgment call about whether the system has the capacity to absorb the perturbation. However, when an autonomous remediation agent is introduced, that question disappears. The agent sees an anomaly and takes an action, without any human judgment about whether the timing is right.
A specific failure mode has been observed, where a remediation agent detects elevated latency on a microservice and responds by restarting the service cluster. However, the agent doesn't know that three other services are handling peak traffic, the shared connection pool is already at 87% utilization, and a dependent database is running a background index rebuild. The restart triggers a cascade of failures, which the agent was never designed to model. This type of incident is not tracked or handled by current engineering teams.
The underlying problem is that enterprise systems have no shared language for absorb capacity, which is the real-time estimate of how much additional stress a system can take before it breaches its SLO commitments. Chaos engineering programs manage it implicitly, through human judgment and static thresholds. However, agents don't manage it at all. A resilience budget model has been developed, which treats absorb capacity as a continuously recomputed, consumable resource. This model draws on four live signal classes: SLO burn rate, P99 latency trend, dependency saturation state, and application behavioral signals.
The use of large language models (LLMs) to generate chaos hypotheses from dependency graphs and incident postmortem corpora is being explored. While the results are directionally useful, there are limits to their effectiveness. The models are only as good as the data they are trained on, and they can propose experiments with incorrect blast radius assumptions. Postmortems describe failures that actually occurred in the system at a specific moment in time, making them a more reliable source of information.
Source: VentureBeat