From Benchmark Scores to Deployment Readiness: A Journal-scale Evaluation Framework for Autonomous Software Development Agents

doi:10.5121/ijsea.2026.17102

Volume 17, Number 2

From Benchmark Scores to Deployment Readiness: A Journal-scale Evaluation Framework for Autonomous Software Development Agents

Authors

Partha Sarathi Samal , Suresh Kumar Palus , and Sai Kiran Padmam , Independent Researcher, USA

Abstract

Autonomous software engineering agents are progressing from code assistants to systems that can plan and execute multi-step repository changes. This shift requires evaluation methods that move beyond one-shot pass rate reporting. Many current studies still provide limited visibility into repeatability, failure recovery, and operational efficiency under realistic constraints. This journal article presents an expanded DevAgentBench and DevAgentEval methodology designed for deployment-oriented assessment. The benchmark covers bug fixing, test generation, refactoring, code review assistance, and long-horizon feature work. We organize analysis into three metric layers: task-level success and correctness, robustness under perturbation, and business-aligned operational efficiency. We also formalize a nine-category failure-mode taxonomy linked to trace-level evidence and remediation guidance. Baseline experiments across agent patterns and model families show that rankings are sensitive to context reduction, tool-output noise, transient execution failures, and tighter resource budgets. These findings indicate that average success rates alone are insufficient for production decisions. We therefore recommend condition-aware reporting, repeated-run variance estimation, and reproducible artifact release as minimum standards for autonomous software-agent benchmarking.

Keywords

Autonomous software agents, agentic AI, software engineering benchmarks, repository-scale evaluation, reliability analysis, robustness testing, failure taxonomy, bug fixing, test generation, code review automation, refactoring, long-horizon software tasks, tool-use evaluation, operational metrics, cost-aware evaluation, deployment readiness