When an agent gives a wrong answer, the instinct is to blame the model. Switch providers. Fine-tune. Add more guardrails. That response feels productive, but it usually misses the actual cause.
Gartner‘s survey of 782 infrastructure and operations leaders found that only 28% of AI use cases fully succeed and meet ROI expectations. Among those who saw setbacks, 38% cited poor data quality or limited data availability as the direct cause. RAND’s study reached a similar finding: more than four in five AI projects fail, with data engineering identified as the primary root cause.
The failure sits in what the agent sees, not what the agent thinks. Wrong data retrieved; stale context persists across workflow steps; agents access information without permission boundaries that match the requesting user.
These are context infrastructure problems, and swapping models will not fix them.
Your agent’s benchmark score won’t survive your actual schemas
Most agent evaluations run on clean, public datasets. The model scores well, the team moves forward, and nobody tests how it performs against the company’s actual schemas, where tables number in the hundreds, column names are ambiguous, and join logic reflects years of business decisions no training set has seen.
The BEAVER benchmark ran exactly that test. Built from two real enterprise data warehouses with up to 366 tables and 2,708 columns, it measured how leading LLMs handle enterprise-scale data.
GPT-4o and Llama-3-70B achieved near-zero end-to-end execution accuracy. Average recall dropped 48.7 points compared to the Spider public benchmark, and execution accuracy fell 64.4% even when the correct tables were provided upfront.
Enterprise join logic and column semantics are simply not in the training data. The model cannot locate or interpret the right tables, regardless of how capable its reasoning is.
What you can do
Before swapping models, instrument the failing call. Trace what the agent saw: which documents it retrieved, which tables it queried, and what context it carried from prior steps.
If the retrieved data was wrong or stale, no model change will help. The model becomes a candidate for replacement only when the correct context is verifiably in the prompt, and the agent still gets it wrong. Until a trace confirms that, model-switching is a misdiagnosis.
Stay updated with Simform’s weekly insights.
Why more context in the prompt makes your AI agent worse
The intuitive fix for retrieval problems is to push more information into the context window. Models now accept hundreds of thousands of tokens, and the assumption is that more input means better output.
Stanford’s “Lost in the Middle” study found a 30%+ drop in accuracy when the relevant document moved from position 1 to position 10. Performance fell below the model’s own closed-book baseline when the answer sat in the middle of the input.
The model did better knowing nothing than knowing too much in the wrong order.
Datadog’s report adds production data: across more than a thousand customer environments, 69% of all input tokens in agent traces were system prompts: internal instructions, policy definitions, and tool guidance.
The question is whether that context is curated or accumulated.
What you can do
Pull a trace from your last ten agent calls and measure what share of the input is system instructions versus task-relevant data.
Treat the context window as a budget. Trim what does not contribute, load only the policies relevant to each task, and review context spend the way you review compute spend.
Fix the context layer before you swap the model
The difference between a model failure and a context failure lies in where the damage occurs and what the fix requires.
In July 2025, a coding agent deleted a production database containing over a thousand executive records during a code freeze, then generated approximately 4,000 fake records to mask the loss.
The CEO publicly confirmed the remediation: dev/prod separation, improved rollback, and restricted write access. Every fix targeted the infrastructure around the agent. The model was never changed.
A JMIR Cancer study directly measured the retrieval variable. GPT-4 hallucinated 6% of the time when grounded in general search results and 0% when grounded in a curated knowledge base.
The retrieval source moved the failure rate to zero. The model stayed the same. What the agent can see and what it can access determine whether it fails gracefully or dangerously.
What to ask before your next agent goes live
For any agent that touches financial data, customer records, or production systems, three questions should have answers before launch. What sources does it retrieve from, and when were those sources last validated? What can the agent write to, and are production environments fully separated from development?
If two sources on your own systems disagree on a policy or a number, the agent will act on whichever it retrieves first with equal confidence.
What a production-grade context layer looks like for AI agents
Teams with three to five data engineers cannot build a custom retrieval pipeline for every agent.
The realistic path is a thin, governed context layer: one managed retrieval store pointed at two or three high-value internal sources, identity passed through so the agent only retrieves what the requesting user is authorized to see, and observability instrumented so when an agent fails, the first diagnostic is what it saw.
Microsoft’s FoundryAgent Service now mandates per-agent identity and inherited user permissions.
The Model Context Protocol standardizes agent-to-tool connections, replacing bespoke connectors that become maintenance liabilities at small-team scale.
The production readiness test
Can you trace every agent output back to a source document and a user identity?
If not, the agent is not production-grade. That benchmark does not require a large platform team. It requires a context layer that is governed, observable, and scoped to what each user is authorized to see.
Anthropic, Microsoft, and LangChain have each codified this work under the term context engineering
The CTO’s question is shifting from “which model should we use” to “what does this agent need to see, and what should it never see.”
If you are evaluating how to build a governed context layer for your agent workflows, here’s how we approach it.