Some Thoughts on Harness Engineering关于 Harness Engineering 的一些思考
早在今年 4 月,我就接触到了 Harness Engineering 这个概念。当时的第一反应是:这大概又是 AI 圈造出来的新词。但这两天深入研究了 OpenCode 的源码后,我发现这个概念确实“有点说法”。
在 Agent(智能体)概念刚提出时,主流观点认为:所谓的 Prompt Engineering、Skills、Agent Frameworks 都是因为模型能力不足而产生的过渡产物。如果未来 3-5 年模型持续进化,这些辅助手段终将消失。当时我也深以为然,并目睹了不少证据:早先有人孜孜不倦地分享 Prompt 写法,甚至将其封装后在各大平台售卖,比如“你是一个专业翻译,需遵守 XXX 原则”或“你是一个 XX 领域专家”。随着模型能力的飞跃(尤其是 GPT-4 之后的代际更迭),即使 Prompt 写得含糊不清、带有语病,模型也能精准地给出回复。
但奇怪的是,从 2022 年 GPT-3.5 横空出世,到如今顶级模型如 GPT-5.5、Claude 4.7 Opus、Gemini 3.1 Pro 割据天下,市面上的模型能力已成倍增长。我们见证了 AI 从分不清“9.11 和 9.9 哪个大”,进化到如今即便不看代码,也能独自完成一套复杂的前后端分离系统的上线。然而在这个过程中,Agent 框架和工具不仅没有消失,反而呈现爆炸式增长。各大厂商都在推行自己的 MCP(Model Context Protocol),一个 Skill 插件在 GitHub 上动辄收获数千 Star。
虽然我认为依靠 Markdown 文档和 Prompt 来约束 LLM 行为的“临时性补丁”迟早会退出历史舞台——就像当年为了增强推理能力而喊出的 “Thinking step by step”,在原生具备推理能力的 Reasoning 模型普及后便鲜有人提起。但在这个演化过程中,Harness Engineering 却脱颖而出。
那么,Harness Engineering 到底在解决什么问题?
从我的理解来看,Harness 并不是为了补足 LLM 的短处,而是在承接模型“不应该”承担的部分。
2025 年以后,AI 的编程水平在理论上已经超越了人类。但在实际生产中,AI 仍然可能写出逻辑诡异的 Bug,依然需要人类进行 Review。这正是 Harness Engineering 发挥作用的地方。类比一下:一个能力极强的老板固然可以把公司做大,但当公司达到一定体量后,没有一套完善的管理体系是行不通的。Harness Engineering 充当的正是这套“管理体系”。
综上所述,Harness Engineering 不会因为 LLM 能力的提升而消失,反而会随着模型的变强,在真实生产场景中变得不可或缺。因为无论 AI 模型变得多强,查询循环(Query Loop)、上下文压缩(Context Compaction)、权限边界(Permission Boundary)以及验证机制(Validation Mechanism)永远不会消失。它们是确保 AI 能力平稳落地的基石。
As early as this April, I had already come across the concept of Harness Engineering. My first reaction at the time was: this was probably just another buzzword coined by the AI community. But after studying OpenCode’s source code in depth over the past two days, I realized that there is indeed some substance behind this concept.
When AI agents first entered mainstream discussion, the mainstream view was that so-called Prompt Engineering, Skills, and Agent Frameworks were merely transitional products born out of insufficient model capabilities. If models continued to evolve over the next 3–5 years, these auxiliary methods would eventually disappear. At the time, I found this view quite convincing, and I witnessed quite a bit of evidence: in the early days, people tirelessly shared ways of writing prompts, and even packaged them and sold them across different platforms, such as “You are a professional translator and must follow XXX principles” or “You are an expert in the field of XX.” With the leap in model capabilities, especially the generational progress after GPT-4, even if a prompt was vague, poorly worded, or grammatically flawed, the model could still give an accurate response.
But strangely, from GPT-3.5’s sudden rise in 2022 to today’s fragmented landscape of top-tier models, with GPT-5.5, Claude 4.7 Opus, and Gemini 3.1 Pro each occupying a major position, the capabilities of models on the market have multiplied many times over. We have witnessed AI evolve from being unable to tell which is larger between “9.11 and 9.9,” to now being able to independently deploy a complex front-end/back-end separated system with minimal human inspection of the code. However, during this process, Agent frameworks and tools have not disappeared. Instead, they have shown explosive growth. Major companies are all promoting their own MCPs, Model Context Protocols, and a single Skill plugin can easily rack up thousands of stars on GitHub.
Although I believe that the “temporary patches” that rely on Markdown documents and prompt instructions to constrain LLM behavior will sooner or later exit the stage of history — just like “thinking step by step,” which was once widely promoted as a way to improve reasoning, but is now rarely mentioned after reasoning models with native reasoning capabilities became widespread — in this process of evolution, Harness Engineering has stood out.
So, what problem exactly is Harness Engineering solving?
From my understanding, Harness is not meant to make up for the shortcomings of LLMs, but to handle the parts that models “should not” be responsible for.
After 2025, AI’s programming ability has, at least in theory, already surpassed that of humans. But in real production, AI may still write code with bizarre logical bugs, and still requires human review. This is exactly where Harness Engineering comes into play. To use an analogy: a highly capable boss can certainly grow a company, but once the company reaches a certain scale, it cannot function without a complete management system. Harness Engineering is precisely this kind of “management system.”
In summary, Harness Engineering will not disappear because of the improvement of LLM capabilities. On the contrary, as models become stronger, it will become indispensable in real production scenarios. Because no matter how powerful AI models become, Query Loops, Context Compaction, Permission Boundaries, and Validation Mechanisms will never disappear. They are the cornerstones that ensure AI capabilities can be steadily implemented in practice.