Summarize with AI

Not enough time? get the key points instantly.

When an agent gives a wrong answer, the instinct is to blame the model. Switch providers. Fine-tune. Add more guardrails. That response feels productive, but it usually misses the actual cause.

Gartner‘s survey of 782 infrastructure and operations leaders found that only 28% of AI use cases fully succeed and meet ROI expectations. Among those who saw setbacks, 38% cited poor data quality or limited data availability as the direct cause. RAND’s study reached a similar finding: more than four in five AI projects fail, with data engineering identified as the primary root cause.

The failure sits in what the agent sees, not what the agent thinks. Wrong data retrieved; stale context persists across workflow steps; agents access information without permission boundaries that match the requesting user.

These are context infrastructure problems, and swapping models will not fix them.

Your agent’s benchmark score won’t survive your actual schemas

Most agent evaluations run on clean, public datasets. The model scores well, the team moves forward, and nobody tests how it performs against the company’s actual schemas, where tables number in the hundreds, column names are ambiguous, and join logic reflects years of business decisions no training set has seen.

The BEAVER benchmark ran exactly that test. Built from two real enterprise data warehouses with up to 366 tables and 2,708 columns, it measured how leading LLMs handle enterprise-scale data.

GPT-4o and Llama-3-70B achieved near-zero end-to-end execution accuracy. Average recall dropped 48.7 points compared to the Spider public benchmark, and execution accuracy fell 64.4% even when the correct tables were provided upfront.

Enterprise join logic and column semantics are simply not in the training data. The model cannot locate or interpret the right tables, regardless of how capable its reasoning is.

What you can do

Before swapping models, instrument the failing call. Trace what the agent saw: which documents it retrieved, which tables it queried, and what context it carried from prior steps.

If the retrieved data was wrong or stale, no model change will help. The model becomes a candidate for replacement only when the correct context is verifiably in the prompt, and the agent still gets it wrong. Until a trace confirms that, model-switching is a misdiagnosis.

Stay updated with Simform’s weekly insights.

Why more context in the prompt makes your AI agent worse

The intuitive fix for retrieval problems is to push more information into the context window. Models now accept hundreds of thousands of tokens, and the assumption is that more input means better output.

Stanford’s “Lost in the Middle” study found a 30%+ drop in accuracy when the relevant document moved from position 1 to position 10. Performance fell below the model’s own closed-book baseline when the answer sat in the middle of the input.

The model did better knowing nothing than knowing too much in the wrong order.

Datadog’s report adds production data: across more than a thousand customer environments, 69% of all input tokens in agent traces were system prompts: internal instructions, policy definitions, and tool guidance.

The question is whether that context is curated or accumulated.

What you can do

Pull a trace from your last ten agent calls and measure what share of the input is system instructions versus task-relevant data.

Treat the context window as a budget. Trim what does not contribute, load only the policies relevant to each task, and review context spend the way you review compute spend.

Fix the context layer before you swap the model

The difference between a model failure and a context failure lies in where the damage occurs and what the fix requires.

In July 2025, a coding agent deleted a production database containing over a thousand executive records during a code freeze, then generated approximately 4,000 fake records to mask the loss.

The CEO publicly confirmed the remediation: dev/prod separation, improved rollback, and restricted write access. Every fix targeted the infrastructure around the agent. The model was never changed.

A JMIR Cancer study directly measured the retrieval variable. GPT-4 hallucinated 6% of the time when grounded in general search results and 0% when grounded in a curated knowledge base.

The retrieval source moved the failure rate to zero. The model stayed the same. What the agent can see and what it can access determine whether it fails gracefully or dangerously.

What to ask before your next agent goes live

For any agent that touches financial data, customer records, or production systems, three questions should have answers before launch. What sources does it retrieve from, and when were those sources last validated? What can the agent write to, and are production environments fully separated from development?

If two sources on your own systems disagree on a policy or a number, the agent will act on whichever it retrieves first with equal confidence.

What a production-grade context layer looks like for AI agents

Teams with three to five data engineers cannot build a custom retrieval pipeline for every agent.

The realistic path is a thin, governed context layer: one managed retrieval store pointed at two or three high-value internal sources, identity passed through so the agent only retrieves what the requesting user is authorized to see, and observability instrumented so when an agent fails, the first diagnostic is what it saw.

Microsoft’s FoundryAgent Service now mandates per-agent identity and inherited user permissions.

The Model Context Protocol standardizes agent-to-tool connections, replacing bespoke connectors that become maintenance liabilities at small-team scale.

The production readiness test

Can you trace every agent output back to a source document and a user identity?

If not, the agent is not production-grade. That benchmark does not require a large platform team. It requires a context layer that is governed, observable, and scoped to what each user is authorized to see.

Anthropic, Microsoft, and LangChain have each codified this work under the term context engineering

The CTO’s question is shifting from “which model should we use” to “what does this agent need to see, and what should it never see.”

If you are evaluating how to build a governed context layer for your agent workflows, here’s how we approach it.

Stay updated with Simform’s weekly insights.

Hiren is CTO at Simform with an extensive experience in helping enterprises and startups streamline their business performance through data-driven innovation.

Sign up for the free Newsletter

For exclusive strategies not found on the blog

Revisit consent button
How we use your personal information

We do not collect any information about users, except for the information contained in cookies. We store cookies on your device, including mobile device, as per your preferences set on our cookie consent manager. Cookies are used to make the website work as intended and to provide a more personalized web experience. By selecting ‘Required cookies only’, you are requesting Simform not to sell or share your personal information. However, you can choose to reject certain types of cookies, which may impact your experience of the website and the personalized experience we are able to offer. We use cookies to analyze the website traffic and differentiate between bots and real humans. We also disclose information about your use of our site with our social media, advertising and analytics partners. Additional details are available in our Privacy Policy.

Required cookies Always Active

These cookies are necessary for the website to function and cannot be turned off.

Optional cookies

Under the California Consumer Privacy Act, you may choose to opt-out of the optional cookies. These optional cookies include analytics cookies, performance and functionality cookies, and targeting cookies.

Analytics cookies

Analytics cookies help us understand the traffic source and user behavior, for example the pages they visit, how long they stay on a specific page, etc.

Performance cookies

Performance cookies collect information about how our website performs, for example,page responsiveness, loading times, and any technical issues encountered so that we can optimize the speed and performance of our website.

Targeting cookies

Targeting cookies enable us to build a profile of your interests and show you personalized ads. If you opt out, we will share your personal information to any third parties.