Nine in ten engineers now report using AI tools daily. Most engineering leaders evaluate the investment based on Copilot acceptance rates and self-reported speed, and both are trending upward.
AI pays off across the SDLC. But the returns are unevenly distributed. The only randomized controlled trial of experienced developers on familiar codebases measured a 19% slowdown, even as those developers believed they were 20% faster. The DORA report confirms the pattern. Yhroughput gains are now positive, but instability persists.
Most mid-market budgets sit in coding. The phases with the strongest measured payoffs draw almost no investment.
Testing and observability deliver the clearest AI returns today
The strongest AI returns in the SDLC come from two phases most teams haven’t tried yet.
Meta’s TestGen-LLM is the best-documented AI testing deployment at an industrial scale. Nearly three-quarters of its generated test recommendations were accepted into production during Instagram and Facebook test-a-thons, and more than one in ten classes it touched saw measurable coverage improvements.
A study confirmed the pattern from a different angle, using reinforcement learning to surface the first failing test within the top 16% of the suite. Both approached the same problem the same way: score components by defect history and change frequency, then focus testing effort where it matters most.
On the operations side, Gartner projects AIOps adopters will reduce MTTR by up to 40% by 2027. The DORA report adds an important qualifier: AI amplifies both existing operational strengths and dysfunctions equally.
What makes both phases safer first investments than coding tools is the blast radius. A bad AI test gets caught by the CI gate. A bad AI alert gets triaged by the on-call engineer. The failure mode is noise, not production defects.
What you can do
Pick one test-heavy codebase. Deploy AI test generation with acceptance gates: tests must build, pass reliably, and increase coverage. If testing ROI proves out, AIOps on one high-alert-volume production service is the natural second move.
Stay updated with Simform’s weekly insights.
Requirements and legacy discovery deliver the highest leverage
AI’s per-hour value is highest in the phases that sit before coding.
A peer-reviewed study tested AI-generated acceptance criteria against independent domain expert judgment. More than 80% of the AI-generated criteria were judged as relevant additions to existing user stories.
In structured pilots we have seen run on complex-domain projects, BA and QA pairs who loaded domain context into AI tools and generated acceptance criteria reported shorter rework cycles and better edge-case coverage than manual analysis alone.
Defects caught at requirements cost up to 100 times less to fix than those that reach production, which makes every improvement here a multiplier for everything downstream.
McKinsey’s LegacyX program reports a 40-50% acceleration in modernization timelines. In one case, 20,000 lines of COBOL, estimated at 700 to 800 hours of manual effort, were reduced by 40% with AI agents handling discovery and mapping. AI improves the discovery phase that precedes a rewrite. It does not fix the rewrite itself.
What you can do
Pilot AI-assisted requirements on your highest-rework feature area. For legacy code older than 3 years with the original engineers gone, start with a discovery-only AI scoping engagement to map dependencies and surface undocumented business logic before committing to a full modernization program. NeuVantage codifies this discovery into a structured modernization assessment.
Architecture decisions need more caution than any other phase
Architecture is a trade-off analysis. AI is pattern completion. The gap between those two tasks is structural, not a tool limitation.
A peer-reviewed analysis published in TMLR provides the clearest evidence that LLMs can articulate correct architectural principles but do not reliably apply them. The authors call this “computational split-brain syndrome,” where the model understands the trade-off in theory but cannot execute the reasoning that resolves it.
InfoQ’s architect community reached a similar conclusion. AI can suggest alternatives when given sufficient context, but it cannot make decisions.
A large-scale analysis of AI-generated code found that roughly a quarter to a third of the output contained exploitable security weaknesses, which is why security architecture choices should be made specifically by humans.
The less visible risk is the gap between AI-generated code and the team’s understanding of the system it runs. When AI writes faster than engineers build mental models, architectural review becomes reconstruction rather than judgment.
What you can do
Use AI to draft architecture decision records, conduct prior art research, and document. Exclude it from trade-off decisions, novel component design, and security architecture choices.
Sequencing drives more value than switching tools
Most teams invested in coding tools first and are now planning to expand. Gartner says that teams that apply AI only to coding capture roughly 10% productivity gains, while teams deploying across the SDLC are projected to capture 25 to 30% by 2028.
Bain’s Report explains that coding accounts for only 25-35% of the time from idea to product launch. A tool that accelerates a quarter of the timeline has a ceiling.
For teams with less platform discipline than Microsoft or Google, a coding-first rollout risks yielding throughput gains at the expense of downstream instability.
What you can do
Before expanding your coding-tool rollout, instrument downstream effects such as code churn, PR size, review time, and change-failure rate. If quality signals worsen after 90 days, redirect the budget to testing or observability first.
PexAI provides the operating framework for this sequencing, with standardized blueprints that govern how AI integrates across each phase. Explore where to start.