Self-improving Coding Agents
In this article, I demonstrate how a more capable coding agent (Codex/GPT-5.4) successfully refined the prompt for a less capable agent (GitHub Copilot CLI/GPT-5-mini) using a custom evaluation too...

Source: DEV Community
In this article, I demonstrate how a more capable coding agent (Codex/GPT-5.4) successfully refined the prompt for a less capable agent (GitHub Copilot CLI/GPT-5-mini) using a custom evaluation tool called Katt. This "Test-Driven Agentic Workflow" (TDAW) increased passing evals from 0/3 to 3/3, but also significantly increased runtime and token usage, demonstrating that an agent can improve a prompt's effectiveness by clarifying instructions and fixing the evaluation process itself. Why this article now? I realized I haven't been very verbose about my AI research in relation to coding agents and the ability to extract accurate outputs, even with low-capability models. I've spent the last 3 years experimenting with ML and AI, and 1.5 years with coding agents, while using my engineering skills to create different harness strategies on top of them. With this more recent experiment, I felt like it was time to be more outspoken and share a bit of my learnings and journey with AI. So here co