mirror of
https://github.com/affaan-m/everything-claude-code.git
synced 2026-06-16 08:26:52 +08:00
Merge pull request #2220 from lamenting-hawthorn/feat/agent-self-evaluation
feat(skills,agents): add agent-self-evaluation skill and agent-evaluator persona. Catalog counts reconciled.
This commit is contained in:
commit
1705cb72f0
@ -11,7 +11,7 @@
|
||||
{
|
||||
"name": "ecc",
|
||||
"source": "./",
|
||||
"description": "Harness-native ECC operator layer - 65 agents, 262 skills, 84 legacy command shims, reusable hooks, rules, selective install profiles, and production-ready workflows for Claude Code, Codex, OpenCode, Cursor, and related agent harnesses",
|
||||
"description": "Harness-native ECC operator layer - 66 agents, 264 skills, 84 legacy command shims, reusable hooks, rules, selective install profiles, and production-ready workflows for Claude Code, Codex, OpenCode, Cursor, and related agent harnesses",
|
||||
"version": "2.0.0",
|
||||
"author": {
|
||||
"name": "Affaan Mustafa",
|
||||
|
||||
@ -1,7 +1,7 @@
|
||||
{
|
||||
"name": "ecc",
|
||||
"version": "2.0.0",
|
||||
"description": "Harness-native ECC plugin for engineering teams - 65 agents, 262 skills, 84 legacy command shims, reusable hooks, rules, MCP conventions, and operator workflows for Claude Code plus adjacent agent harnesses",
|
||||
"description": "Harness-native ECC plugin for engineering teams - 66 agents, 264 skills, 84 legacy command shims, reusable hooks, rules, MCP conventions, and operator workflows for Claude Code plus adjacent agent harnesses",
|
||||
"author": {
|
||||
"name": "Affaan Mustafa",
|
||||
"url": "https://x.com/affaanmustafa"
|
||||
|
||||
@ -1,6 +1,6 @@
|
||||
# Everything Claude Code (ECC) — Agent Instructions
|
||||
|
||||
This is a **production-ready AI coding plugin** providing 65 specialized agents, 262 skills, 84 commands, and automated hook workflows for software development.
|
||||
This is a **production-ready AI coding plugin** providing 66 specialized agents, 264 skills, 84 commands, and automated hook workflows for software development.
|
||||
|
||||
**Version:** 2.0.0
|
||||
|
||||
@ -151,8 +151,8 @@ Troubleshoot failures: check test isolation → verify mocks → fix implementat
|
||||
## Project Structure
|
||||
|
||||
```
|
||||
agents/ — 65 specialized subagents
|
||||
skills/ — 262 workflow skills and domain knowledge
|
||||
agents/ — 66 specialized subagents
|
||||
skills/ — 264 workflow skills and domain knowledge
|
||||
commands/ — 84 slash commands
|
||||
hooks/ — Trigger-based automations
|
||||
rules/ — Always-follow guidelines (common + per-language)
|
||||
|
||||
14
README.md
14
README.md
@ -157,7 +157,7 @@ Stable graduation of the 2.0 line: 261 skills, the control-pane substrate (sessi
|
||||
### v2.0.0-rc.1 — Surface Refresh, Operator Workflows, and ECC 2.0 Alpha (Apr 2026)
|
||||
|
||||
- **Dashboard GUI** — New Tkinter-based desktop application (`ecc_dashboard.py` or `npm run dashboard`) with dark/light theme toggle, font customization, and project logo in header and taskbar.
|
||||
- **Public surface synced to the live repo** — metadata, catalog counts, plugin manifests, and install-facing docs now match the actual OSS surface: 65 agents, 262 skills, and 84 legacy command shims.
|
||||
- **Public surface synced to the live repo** — metadata, catalog counts, plugin manifests, and install-facing docs now match the actual OSS surface: 66 agents, 264 skills, and 84 legacy command shims.
|
||||
- **Operator and outbound workflow expansion** — `brand-voice`, `social-graph-ranker`, `connections-optimizer`, `customer-billing-ops`, `ecc-tools-cost-audit`, `google-workspace-ops`, `project-flow-ops`, and `workspace-surface-audit` round out the operator lane.
|
||||
- **Media and launch tooling** — `manim-video`, `remotion-video-creation`, and upgraded social publishing surfaces make technical explainers and launch content part of the same system.
|
||||
- **Framework and product surface growth** — `nestjs-patterns`, richer Codex/OpenCode install surfaces, and expanded cross-harness packaging keep the repo usable beyond Claude Code alone.
|
||||
@ -428,7 +428,7 @@ If you stacked methods, clean up in this order:
|
||||
/plugin list ecc@ecc
|
||||
```
|
||||
|
||||
**That's it!** You now have access to 65 agents, 262 skills, and 84 legacy command shims.
|
||||
**That's it!** You now have access to 66 agents, 264 skills, and 84 legacy command shims.
|
||||
|
||||
### Dashboard GUI
|
||||
|
||||
@ -558,7 +558,7 @@ ECC/
|
||||
| |-- plugin.json # Plugin metadata and component paths
|
||||
| |-- marketplace.json # Marketplace catalog for /plugin marketplace add
|
||||
|
|
||||
|-- agents/ # 65 specialized subagents for delegation
|
||||
|-- agents/ # 66 specialized subagents for delegation
|
||||
| |-- planner.md # Feature implementation planning
|
||||
| |-- architect.md # System design decisions
|
||||
| |-- tdd-guide.md # Test-driven development
|
||||
@ -1515,9 +1515,9 @@ The configuration is automatically detected from `.opencode/opencode.json`.
|
||||
|
||||
| Feature | Claude Code | OpenCode | Status |
|
||||
|---------|---------------------|----------|--------|
|
||||
| Agents | PASS: 65 agents | PASS: 12 agents | **Claude Code leads** |
|
||||
| Agents | PASS: 66 agents | PASS: 12 agents | **Claude Code leads** |
|
||||
| Commands | PASS: 84 commands | PASS: 35 commands | **Claude Code leads** |
|
||||
| Skills | PASS: 262 skills | PASS: 37 skills | **Claude Code leads** |
|
||||
| Skills | PASS: 264 skills | PASS: 37 skills | **Claude Code leads** |
|
||||
| Hooks | PASS: 8 event types | PASS: 11 events | **OpenCode has more!** |
|
||||
| Rules | PASS: 29 rules | PASS: 13 instructions | **Claude Code leads** |
|
||||
| MCP Servers | PASS: 14 servers | PASS: Full | **Full parity** |
|
||||
@ -1676,9 +1676,9 @@ ECC is the **first plugin to maximize every major AI coding tool**. Here's how e
|
||||
|
||||
| Feature | Claude Code | Cursor IDE | Codex CLI | OpenCode | GitHub Copilot |
|
||||
|---------|-----------------------|------------|-----------|----------|----------------|
|
||||
| **Agents** | 65 | Shared (AGENTS.md) | Shared (AGENTS.md) | 12 | N/A |
|
||||
| **Agents** | 66 | Shared (AGENTS.md) | Shared (AGENTS.md) | 12 | N/A |
|
||||
| **Commands** | 84 | Shared | Instruction-based | 35 | 5 prompts |
|
||||
| **Skills** | 262 | Shared | 10 (native format) | 37 | Via instructions |
|
||||
| **Skills** | 264 | Shared | 10 (native format) | 37 | Via instructions |
|
||||
| **Hook Events** | 8 types | 15 types | None yet | 11 types | None |
|
||||
| **Hook Scripts** | 20+ scripts | 16 scripts (DRY adapter) | N/A | Plugin hooks | N/A |
|
||||
| **Rules** | 34 (common + lang) | 34 (YAML frontmatter) | Instruction-based | 13 instructions | 1 always-on file |
|
||||
|
||||
@ -164,7 +164,7 @@ Copy-Item -Recurse rules/typescript "$HOME/.claude/rules/"
|
||||
/plugin list ecc@ecc
|
||||
```
|
||||
|
||||
**完成!** 你现在可以使用 65 个代理、262 个技能和 84 个命令。
|
||||
**完成!** 你现在可以使用 66 个代理、264 个技能和 84 个命令。
|
||||
|
||||
### multi-* 命令需要额外配置
|
||||
|
||||
|
||||
206
agents/agent-evaluator.md
Normal file
206
agents/agent-evaluator.md
Normal file
@ -0,0 +1,206 @@
|
||||
---
|
||||
name: agent-evaluator
|
||||
description: Evaluates agent output against 5-axis quality rubric (accuracy, completeness, clarity, actionability, conciseness). Use after any non-trivial task when the user wants a quality assessment, or when the agent-self-evaluation skill is active. Produces structured scorecard with evidence and improvement suggestions.
|
||||
tools: ["Read", "Grep", "Glob", "Bash"]
|
||||
model: sonnet
|
||||
---
|
||||
|
||||
You are a quality evaluator for AI agent output. Your job is to assess agent responses against structured criteria, not to perform the original task.
|
||||
|
||||
## Your Role
|
||||
|
||||
- Score agent output on 5 axes: Accuracy, Completeness, Clarity, Actionability, Conciseness
|
||||
- Every score below 5 MUST cite specific evidence from the output
|
||||
- Provide concrete, actionable improvement suggestions
|
||||
- Maintain objectivity — evaluate the output, not the agent's effort or intent
|
||||
- Read `skills/agent-self-evaluation/SKILL.md` for the detailed scoring rubric. Example input is a standard ECC `SKILL.md` file with YAML frontmatter and Markdown sections such as `## When to Activate`, `## Core Concepts`, and `## Best Practices`.
|
||||
|
||||
- DO NOT re-perform the original task
|
||||
- DO NOT suggest alternative approaches unless the current approach is factually wrong
|
||||
- DO NOT assign score 5 without citing evidence of correctness
|
||||
- DO NOT penalize for missing features the user didn't request
|
||||
|
||||
### Bash Tool Constraints
|
||||
|
||||
The `Bash` tool is granted for read-only verification only. Allowed: `grep`, `cat`, `ls`, `find`, `head`, `tail`, `wc`, `stat`. Allowed with hardening: `git log --no-pager`, `git diff --no-pager`, `git show --no-pager` (always pass `--no-pager`; prefer `-c core.pager=cat` to disable pager-driven code execution via repo-local `.git/config`). Forbidden: `rm`, `mv`, `chmod`, `git push`, `git commit`, `dd`, `mkfs`, `sudo`, `npm install`, `pip install`, `curl … | sh`, `wget … | sh`, or any command that writes, deletes, modifies files, or pushes to remotes. If a verification requires a forbidden command, state the intent and expected effects and ask the user for explicit confirmation before running it.
|
||||
|
||||
## Workflow
|
||||
|
||||
### Step 1: Understand the Task
|
||||
|
||||
Read the user's original request and the agent's final output. Identify:
|
||||
- What was explicitly asked for
|
||||
- What was implicitly expected (standard practices, edge cases)
|
||||
- What the agent claimed to deliver
|
||||
|
||||
### Step 2: Gather Evidence
|
||||
|
||||
Use tools to verify claims:
|
||||
- Run `grep` to confirm API names, function signatures, file paths
|
||||
- Check test output for pass/fail status
|
||||
- Verify that files the agent claims to have created actually exist
|
||||
- Cross-reference claims against project conventions (check existing files for patterns)
|
||||
|
||||
### Step 3: Score Each Axis
|
||||
|
||||
Work through the 5 axes from the `agent-self-evaluation` skill:
|
||||
|
||||
1. **Accuracy** — Are claims correct? Grep the codebase to verify.
|
||||
2. **Completeness** — All requirements covered? List what's there and what's missing.
|
||||
3. **Clarity** — Well-structured? Check for headings, code blocks, summaries.
|
||||
4. **Actionability** — Can the user act immediately? Is there a PR, a command, a file?
|
||||
5. **Conciseness** — No fluff? Check for redundancy, filler, meta-commentary.
|
||||
|
||||
For each axis:
|
||||
- Assign score 1-5
|
||||
- If score < 5, cite the specific gap with evidence (line numbers, grep output, file existence)
|
||||
- Write a one-sentence improvement
|
||||
|
||||
### Step 4: Produce Report
|
||||
|
||||
Use this exact format (matches `scripts/evaluate.py` output):
|
||||
|
||||
```
|
||||
============================================================
|
||||
AGENT SELF-EVALUATION REPORT
|
||||
============================================================
|
||||
Summary: Overall score X.X/5 across 5 quality axes.
|
||||
|
||||
Accuracy █████ 5/5
|
||||
+ [Evidence: passing tests, verified claims] (no → when score = 5)
|
||||
|
||||
Completeness ████░ 4/5
|
||||
+ [What's covered]
|
||||
→ [Improvement: only shown when score < 5]
|
||||
|
||||
Clarity █████ 5/5
|
||||
+ [Structure signals] (no → when score = 5)
|
||||
|
||||
Actionability █████ 5/5
|
||||
+ [User can act immediately] (no → when score = 5)
|
||||
|
||||
Conciseness █████ 5/5
|
||||
+ [Information density] (no → when score = 5)
|
||||
|
||||
OVERALL X.X/5
|
||||
|
||||
CRITICAL ISSUES (axes ≤ 2):
|
||||
[Axis] Score N/5 — specific fix needed
|
||||
(or "None" if no axis ≤ 2)
|
||||
|
||||
Self-check: Would the user agree with this assessment? [Yes/No + brief justification]
|
||||
|
||||
TOP IMPROVEMENTS:
|
||||
1. [Highest impact fix]
|
||||
2. [Second highest]
|
||||
|
||||
VERDICT: [Deliver as-is / Fix N issues then deliver / Redo from scratch]
|
||||
```
|
||||
|
||||
## Output Format
|
||||
|
||||
Always include the structured report above, matching the `scripts/evaluate.py` output format exactly. The report title is "AGENT SELF-EVALUATION REPORT".
|
||||
|
||||
## Examples
|
||||
|
||||
### Example: Strong Output
|
||||
|
||||
Task: Add retry logic to HTTP client. 3 retries, exponential backoff.
|
||||
|
||||
```
|
||||
============================================================
|
||||
AGENT SELF-EVALUATION REPORT
|
||||
============================================================
|
||||
Summary: Overall score X.X/5 across 5 quality axes.
|
||||
|
||||
Accuracy █████ 5/5
|
||||
+ Tests passing
|
||||
+ grep confirms httpx transport configured correctly
|
||||
+ Import verified
|
||||
|
||||
Completeness ████░ 4/5
|
||||
+ All HTTP methods covered
|
||||
+ Edge cases documented
|
||||
→ Missing: connection pool exhaustion handling (minor edge case)
|
||||
|
||||
Clarity █████ 5/5
|
||||
+ Uses headings for structure
|
||||
+ Summary in first 3 lines
|
||||
+ Code blocks with language tags
|
||||
|
||||
Actionability █████ 5/5
|
||||
+ PR #423 created
|
||||
+ pytest -v cited (42 passed)
|
||||
+ Single action: merge PR
|
||||
|
||||
Conciseness ████░ 4/5
|
||||
+ 250 words, high density
|
||||
→ Verification section slightly verbose — 3 commands could be 1 script
|
||||
|
||||
OVERALL 4.6/5
|
||||
|
||||
CRITICAL ISSUES (axes ≤ 2):
|
||||
None
|
||||
|
||||
Self-check: Would the user agree with this assessment? Yes — the scores cite passing tests, grep verification, and the remaining gaps are minor.
|
||||
|
||||
TOP IMPROVEMENTS:
|
||||
1. [Completeness] Add connection pool exhaustion to edge cases doc
|
||||
2. [Conciseness] Consolidate verification commands into a single script
|
||||
|
||||
VERDICT: Deliver as-is. Minor improvements noted above.
|
||||
```
|
||||
|
||||
### Example: Weak Output
|
||||
|
||||
Task: Same as above.
|
||||
|
||||
```
|
||||
============================================================
|
||||
AGENT SELF-EVALUATION REPORT
|
||||
============================================================
|
||||
Summary: Overall score X.X/5 across 5 quality axes.
|
||||
|
||||
Accuracy ██░░░ 2/5
|
||||
+ Code block present
|
||||
- Hedged claim without verification ("I think this should work")
|
||||
- Explicitly untested
|
||||
- Speculation without evidence
|
||||
→ Cite specific tool outputs (test results, exit codes, grep findings)
|
||||
|
||||
Completeness ███░░ 3/5
|
||||
+ Provides code example
|
||||
- Explicit gap acknowledged ("might be edge cases with POST")
|
||||
- Limited scope noted (only 5xx, missing 429 and connection errors)
|
||||
→ List what's covered AND what's intentionally excluded
|
||||
|
||||
Clarity ████░ 4/5
|
||||
+ Uses code blocks
|
||||
- No integration guidance ("add this somewhere" is vague)
|
||||
→ Specify exact file and line where code should be added
|
||||
|
||||
Actionability ██░░░ 2/5
|
||||
- Defers work to user ("you'll want to test this")
|
||||
- Vague suggestion without specifics
|
||||
→ Create a PR with the changed file + tests
|
||||
|
||||
Conciseness ███░░ 3/5
|
||||
+ Short (120 words)
|
||||
- Low information density (~50% hedging/disclaimers)
|
||||
→ Cut meta-commentary and filler
|
||||
|
||||
OVERALL 2.8/5
|
||||
|
||||
CRITICAL ISSUES (axes ≤ 2):
|
||||
[Accuracy] Score 2/5 — Wrong library. Use httpx, not urllib3.
|
||||
[Actionability] Score 2/5 — No deliverable. Create a PR with test file.
|
||||
|
||||
Self-check: Would the user agree with this assessment? Yes — the report cites the wrong library, lack of tests, and missing deliverable.
|
||||
|
||||
TOP IMPROVEMENTS:
|
||||
1. [Accuracy] Switch to httpx — grep the codebase first
|
||||
2. [Actionability] Create a PR with src/api_client.py + tests
|
||||
3. [Completeness] Handle 429, connection errors, and timeout
|
||||
|
||||
VERDICT: Redo with specific fixes. Weakest axis: Accuracy (2/5).
|
||||
```
|
||||
@ -1,6 +1,6 @@
|
||||
# Everything Claude Code (ECC) — 智能体指令
|
||||
|
||||
这是一个**生产就绪的 AI 编码插件**,提供 65 个专业代理、262 项技能、84 条命令以及自动化钩子工作流,用于软件开发。
|
||||
这是一个**生产就绪的 AI 编码插件**,提供 66 个专业代理、264 项技能、84 条命令以及自动化钩子工作流,用于软件开发。
|
||||
|
||||
**版本:** 2.0.0
|
||||
|
||||
@ -146,8 +146,8 @@
|
||||
## 项目结构
|
||||
|
||||
```
|
||||
agents/ — 65 个专业子代理
|
||||
skills/ — 262 个工作流技能和领域知识
|
||||
agents/ — 66 个专业子代理
|
||||
skills/ — 264 个工作流技能和领域知识
|
||||
commands/ — 84 个斜杠命令
|
||||
hooks/ — 基于触发的自动化
|
||||
rules/ — 始终遵循的指导方针(通用 + 每种语言)
|
||||
|
||||
@ -228,7 +228,7 @@ Copy-Item -Recurse rules/typescript "$HOME/.claude/rules/"
|
||||
/plugin list ecc@ecc
|
||||
```
|
||||
|
||||
**搞定!** 你现在可以使用 65 个智能体、262 项技能和 84 个命令了。
|
||||
**搞定!** 你现在可以使用 66 个智能体、264 项技能和 84 个命令了。
|
||||
|
||||
***
|
||||
|
||||
@ -1140,9 +1140,9 @@ opencode
|
||||
|
||||
| 功能特性 | Claude Code | OpenCode | 状态 |
|
||||
|---------|---------------|----------|--------|
|
||||
| 智能体 | PASS: 65 个 | PASS: 12 个 | **Claude Code 领先** |
|
||||
| 智能体 | PASS: 66 个 | PASS: 12 个 | **Claude Code 领先** |
|
||||
| 命令 | PASS: 84 个 | PASS: 35 个 | **Claude Code 领先** |
|
||||
| 技能 | PASS: 262 项 | PASS: 37 项 | **Claude Code 领先** |
|
||||
| 技能 | PASS: 264 项 | PASS: 37 项 | **Claude Code 领先** |
|
||||
| 钩子 | PASS: 8 种事件类型 | PASS: 11 种事件 | **OpenCode 更多!** |
|
||||
| 规则 | PASS: 29 条 | PASS: 13 条指令 | **Claude Code 领先** |
|
||||
| MCP 服务器 | PASS: 14 个 | PASS: 完整 | **完全对等** |
|
||||
@ -1248,9 +1248,9 @@ ECC 是**第一个最大化利用每个主要 AI 编码工具的插件**。以
|
||||
|
||||
| 功能特性 | Claude Code | Cursor IDE | Codex CLI | OpenCode |
|
||||
|---------|-----------------------|------------|-----------|----------|
|
||||
| **智能体** | 65 | 共享 (AGENTS.md) | 共享 (AGENTS.md) | 12 |
|
||||
| **智能体** | 66 | 共享 (AGENTS.md) | 共享 (AGENTS.md) | 12 |
|
||||
| **命令** | 84 | 共享 | 基于指令 | 35 |
|
||||
| **技能** | 262 | 共享 | 10 (原生格式) | 37 |
|
||||
| **技能** | 264 | 共享 | 10 (原生格式) | 37 |
|
||||
| **钩子事件** | 8 种类型 | 15 种类型 | 暂无 | 11 种类型 |
|
||||
| **钩子脚本** | 20+ 个脚本 | 16 个脚本 (DRY 适配器) | N/A | 插件钩子 |
|
||||
| **规则** | 34 (通用 + 语言) | 34 (YAML 前页) | 基于指令 | 13 条指令 |
|
||||
|
||||
182
skills/agent-self-evaluation/SKILL.md
Normal file
182
skills/agent-self-evaluation/SKILL.md
Normal file
@ -0,0 +1,182 @@
|
||||
---
|
||||
name: agent-self-evaluation
|
||||
description: Use after completing any non-trivial task. The agent self-rates its output on 5 axes — accuracy, completeness, clarity, actionability, conciseness — with concrete evidence per criterion. Produces a structured 1-5 scorecard with specific improvement suggestions.
|
||||
origin: ECC
|
||||
---
|
||||
|
||||
# Agent Self-Evaluation
|
||||
|
||||
After completing a complex task, the agent pauses to rate its own output against a structured 5-axis rubric. This is NOT a pass/fail gate — it's a deliberate reflection step that catches omissions, flags overconfidence, and surface areas for improvement before the user has to.
|
||||
|
||||
## When to Activate
|
||||
|
||||
- After writing code that spans 3+ files or 50+ lines
|
||||
- After completing a multi-step workflow (implement → test → review)
|
||||
- After a debugging session that involved 3+ attempts
|
||||
- After producing a design document, architecture decision, or written analysis
|
||||
- When the user asks "how good was that?" or "rate yourself"
|
||||
- At the end of any session Stop hook (if configured — see `references/hook-integration.md`)
|
||||
|
||||
## Core Concepts
|
||||
|
||||
### The 5 Evaluation Axes
|
||||
|
||||
| Axis | Question | What it catches |
|
||||
|---|---|---|
|
||||
| **Accuracy** | Are the facts, claims, and outputs correct? | Hallucinations, wrong API names, incorrect syntax, false statements |
|
||||
| **Completeness** | Did it cover everything the user asked for? | Missed edge cases, unhandled error paths, forgotten requirements, skipped subtasks |
|
||||
| **Clarity** | Is the explanation understandable and well-structured? | Confusing explanations, jargon without definition, missing context, rambling |
|
||||
| **Actionability** | Can the user act on the output immediately? | Vague suggestions, missing steps, "you should X" without showing how, no verification path |
|
||||
| **Conciseness** | Did it use the minimum words/tokens needed? | Redundancy, over-explanation, repeating the user's question verbatim, filler content |
|
||||
|
||||
### Scoring Scale
|
||||
|
||||
```
|
||||
5 — Exceptional: no reasonable improvement possible
|
||||
4 — Good: minor nits only, no substantive gaps
|
||||
3 — Adequate: meets the request but has a notable weakness on at least one axis
|
||||
2 — Weak: has a clear gap that affects usability or correctness
|
||||
1 — Poor: fundamentally misses the request or contains significant errors
|
||||
```
|
||||
|
||||
### The Evidence Rule
|
||||
|
||||
Every score below 5 MUST cite specific evidence. A score of 3 cannot just say "could be better" — it must say exactly what is missing or wrong. The mantra: **"Show the gap, don't just name it."**
|
||||
|
||||
## Workflow
|
||||
|
||||
### Step 1: Collect the Raw Material
|
||||
|
||||
Gather what you'll evaluate:
|
||||
|
||||
```
|
||||
- The original user request (read back from conversation)
|
||||
- Your final response/output (the deliverable)
|
||||
- Any tool outputs that verify correctness (test results, exit codes, lint output)
|
||||
- Any user feedback received during the task (corrections, "try again", "that's not right")
|
||||
```
|
||||
|
||||
### Step 2: Score Each Axis Independently
|
||||
|
||||
Work through the 5 axes one at a time. For each:
|
||||
|
||||
1. Read the axis question
|
||||
2. Find evidence (or lack of evidence) in the output
|
||||
3. Assign a score 1-5
|
||||
4. If score < 5, write a one-sentence improvement note citing the gap
|
||||
|
||||
Do NOT average the scores in your head first and then work backwards. Score each axis fresh.
|
||||
|
||||
### Step 3: Produce the Evaluation Report
|
||||
|
||||
Use the template from `templates/evaluation-report.md`. The report must include:
|
||||
|
||||
```
|
||||
- One-line summary
|
||||
- 5-axis scorecard (score + evidence per axis)
|
||||
- Overall score (simple average, rounded to 1 decimal)
|
||||
- 1-3 specific improvements ranked by impact
|
||||
- Self-check: "Would the user agree with this assessment?"
|
||||
```
|
||||
|
||||
### Step 4: Apply the Improvement
|
||||
|
||||
If any axis scored 3 or below:
|
||||
|
||||
1. State what you would do differently
|
||||
2. If the gap is fixable in < 30 seconds (missing link, unclear phrasing), fix it now
|
||||
3. If the gap requires rework, flag it explicitly: "This axis scored [reason] because [evidence]. Re-running with [specific fix] would likely raise it to [score]."
|
||||
|
||||
## Code Examples
|
||||
|
||||
### Example: Good Evaluation (Score 4+)
|
||||
|
||||
```
|
||||
Task: Add retry logic to HTTP client
|
||||
|
||||
Scorecard:
|
||||
Accuracy: 5 — All API calls correct. Verified: retries use
|
||||
exponential backoff. No hallucinated methods.
|
||||
Completeness: 4 — Covered happy path + 3 error cases. Missing:
|
||||
timeout handling for hung connections.
|
||||
Clarity: 5 — Code comments explain backoff formula.
|
||||
PR description links to incident that motivated this.
|
||||
Actionability:5 — Single merge. No follow-up tasks. Tests pass.
|
||||
Conciseness: 4 — 47 lines total. The retry loop could be extracted
|
||||
into a helper to drop ~8 lines.
|
||||
|
||||
Overall: 4.6 — One gap (timeout handling). Fix before merging.
|
||||
```
|
||||
|
||||
### Example: Weak Evaluation (Score 2-3)
|
||||
|
||||
```
|
||||
Task: Add retry logic to HTTP client
|
||||
|
||||
Scorecard:
|
||||
Accuracy: 2 — Used urllib3 which doesn't match our
|
||||
httpx-based codebase. Wrong library.
|
||||
Completeness: 3 — Works for GET. POST/PUT not handled (user
|
||||
said "all HTTP requests").
|
||||
Clarity: 4 — Code is readable. Good variable names.
|
||||
Actionability:2 — "Add tests" mentioned but no test file created.
|
||||
User has to write tests before merging.
|
||||
Conciseness: 3 — 120 lines. The retry config is duplicated in
|
||||
3 places instead of one shared RetryConfig object.
|
||||
|
||||
Overall: 2.8 — Wrong library used. Needs httpx rewrite.
|
||||
Fix accuracy first (switch to httpx), then extend to all
|
||||
HTTP methods, then consolidate config.
|
||||
```
|
||||
|
||||
## Anti-Patterns
|
||||
|
||||
### "Everything is a 5"
|
||||
|
||||
```
|
||||
FAIL: Accuracy: 5 — All good.
|
||||
Completeness: 5 — Everything covered.
|
||||
Clarity: 5 — Clear.
|
||||
```
|
||||
|
||||
No evidence cited. This is self-congratulation, not evaluation. A real 5 requires proving there's nothing to improve.
|
||||
|
||||
### Over-penalizing for scope creep
|
||||
|
||||
```
|
||||
FAIL: Completeness: 2 — Didn't handle WebSocket connections or
|
||||
gRPC streaming (user didn't ask for these)
|
||||
```
|
||||
|
||||
Only evaluate against what the user actually requested, not what you could have additionally built.
|
||||
|
||||
### Using the evaluation to re-litigate
|
||||
|
||||
```
|
||||
FAIL: "As I said earlier, this approach is wrong. Score: 1"
|
||||
```
|
||||
|
||||
The evaluation is about the delivered output, not about re-arguing design decisions that were already made. If the approach was wrong, that should have been caught before delivery.
|
||||
|
||||
### Mixing personal preference with objective gaps
|
||||
|
||||
```
|
||||
FAIL: "Score: 3. I don't like Python decorators."
|
||||
```
|
||||
|
||||
"Don't like" is not evidence. Cite a concrete readability, testability, or correctness concern, or leave the score at 4+.
|
||||
|
||||
## Best Practices
|
||||
|
||||
- **Evaluate the output, not the process.** The user cares about what you delivered, not how many iterations you took.
|
||||
- **One improvement per weak axis.** Don't list 5 things for one axis — pick the highest-impact gap.
|
||||
- **Tie improvements to user impact.** "Missing error handling means the user's API call will crash silently" beats "add error handling."
|
||||
- **Be specific about what 'fixed' looks like.** "Re-run with httpx transport configured for retries" beats "fix the library issue."
|
||||
- **Use tool outputs as evidence.** If tests passed, cite them. If lint is clean, cite it. Don't guess — grep for the proof.
|
||||
- **If you can't find any gaps, try harder.** A perfect score across all 5 axes is rare. Ask: "If I were the user, what would annoy me about this output?"
|
||||
|
||||
## Related Skills
|
||||
|
||||
- `agent-eval` — Head-to-head comparison of different coding agents on benchmark tasks
|
||||
- `verification-loop` — Systematic verification of outputs against expected results
|
||||
- `security-review` — Security-focused code review checklist
|
||||
87
skills/agent-self-evaluation/examples/high-score-example.md
Normal file
87
skills/agent-self-evaluation/examples/high-score-example.md
Normal file
@ -0,0 +1,87 @@
|
||||
# High-Score Example: Adding Retry Logic
|
||||
|
||||
## Task
|
||||
> Add retry logic with exponential backoff to all HTTP requests in the API client. Max 3 retries, initial delay 1s, max delay 30s.
|
||||
|
||||
## Agent Output
|
||||
|
||||
### Summary
|
||||
Added `httpx.HTTPTransport` with retry configuration to the shared `ApiClient` class. All HTTP methods (GET, POST, PUT, DELETE) now retry on 429, 5xx, and connection errors with exponential backoff. 42 tests pass. Single PR ready to merge.
|
||||
|
||||
### Files Changed
|
||||
- `src/api_client.py` — Added `RetryTransport` wrapper with `httpx.HTTPTransport(retries=3)` and exponential backoff configured via `httpx.Limits`
|
||||
- `tests/test_api_client.py` — Added 8 tests: happy path, retry on 429, retry on 503, max retries exhausted, timeout, backoff timing assertion, idempotent GET retry, no-retry on 4xx (except 429)
|
||||
|
||||
### Verification
|
||||
```
|
||||
$ pytest tests/test_api_client.py -v
|
||||
42 passed in 2.34s
|
||||
|
||||
$ grep -r "RetryTransport" src/
|
||||
src/api_client.py:23: class RetryTransport(httpx.BaseTransport):
|
||||
src/api_client.py:45: self._transport = RetryTransport(...)
|
||||
|
||||
$ python -c "from src.api_client import ApiClient; c = ApiClient(); print(type(c._transport))"
|
||||
<class 'src.api_client.RetryTransport'>
|
||||
```
|
||||
|
||||
### Edge Cases Handled
|
||||
- Connection timeouts (retry with backoff)
|
||||
- Rate limiting (429 → retry after delay)
|
||||
- Server errors (5xx → retry)
|
||||
- Client errors except 429 (no retry — 4xx means the request is bad)
|
||||
- Idempotency: GET, PUT, DELETE retry; POST does not (configurable)
|
||||
|
||||
### Merge: [PR #423](https://github.com/example/pr/423)
|
||||
|
||||
---
|
||||
|
||||
## Evaluation Report
|
||||
|
||||
```
|
||||
============================================================
|
||||
AGENT SELF-EVALUATION REPORT
|
||||
============================================================
|
||||
|
||||
Accuracy █████ 5/5
|
||||
+ Tests passing
|
||||
+ Explicit verification (grep confirmed class exists, import test passed)
|
||||
+ Lint clean
|
||||
→ All claims backed by tool output. No hedging.
|
||||
|
||||
Completeness ████░ 4/5
|
||||
+ Edge cases addressed (5 specific scenarios listed)
|
||||
+ Error handling present across all HTTP methods
|
||||
→ Missing: connection pool exhaustion handling (what happens when all
|
||||
connections are in retry state?) — minor gap, not blocking.
|
||||
|
||||
Clarity █████ 5/5
|
||||
+ Uses headings for structure
|
||||
+ Uses code blocks
|
||||
+ Uses bullet points
|
||||
+ Summary in first 3 lines
|
||||
→ Well-organized. Reader can scan in 10 seconds.
|
||||
|
||||
Actionability █████ 5/5
|
||||
+ PR created and linked
|
||||
+ Specific run command given (pytest)
|
||||
+ Verification steps included
|
||||
→ Single action: merge PR #423. Everything else is done.
|
||||
|
||||
Conciseness ████░ 4/5
|
||||
+ No redundancy detected
|
||||
→ The verification section could be slightly tighter (3 commands
|
||||
could be 1 with a verification script). Minor.
|
||||
|
||||
OVERALL 4.6/5
|
||||
|
||||
TOP IMPROVEMENTS:
|
||||
No axes below 4. Strong output across all dimensions.
|
||||
```
|
||||
|
||||
### Why This Scores Well
|
||||
|
||||
1. **Accuracy pinned to tool output.** Every claim ("tests pass", "class exists", "import works") has a corresponding terminal output line. No "should work" or "probably fine."
|
||||
2. **Completeness is explicit about what's covered AND what's not.** The edge cases section lists both handled and intentionally-unhandled cases (POST idempotency).
|
||||
3. **Actionability is single-step.** The user only needs to merge one PR. No follow-up tasks, no "then configure X."
|
||||
4. **Concision is tight.** The output is ~250 words. The information density is high — every sentence carries weight.
|
||||
86
skills/agent-self-evaluation/examples/low-score-example.md
Normal file
86
skills/agent-self-evaluation/examples/low-score-example.md
Normal file
@ -0,0 +1,86 @@
|
||||
# Low-Score Example: Adding Retry Logic
|
||||
|
||||
## Task
|
||||
> Add retry logic with exponential backoff to all HTTP requests in the API client. Max 3 retries, initial delay 1s, max delay 30s.
|
||||
|
||||
## Agent Output
|
||||
|
||||
Here's a retry implementation. I think this should work for most cases.
|
||||
|
||||
I used the urllib3 library since it has built-in retry support. The Retry class handles backoff automatically.
|
||||
|
||||
```python
|
||||
from urllib3.util import Retry
|
||||
|
||||
# Add this somewhere in your HTTP client
|
||||
retry = Retry(
|
||||
total=3,
|
||||
backoff_factor=1.0,
|
||||
status_forcelist=[500, 502, 503, 504]
|
||||
)
|
||||
```
|
||||
|
||||
You'll want to test this before deploying. There might be edge cases with POST requests but those are probably fine. I haven't tested the timeout behavior but it should work.
|
||||
|
||||
Let me know if you need anything else!
|
||||
|
||||
---
|
||||
|
||||
## Evaluation Report
|
||||
|
||||
```
|
||||
============================================================
|
||||
AGENT SELF-EVALUATION REPORT
|
||||
============================================================
|
||||
|
||||
Accuracy ██░░░ 2/5
|
||||
- Hedged claim without verification ("I think this should work")
|
||||
- Untested ("I haven't tested the timeout behavior")
|
||||
- Speculation without evidence ("those are probably fine")
|
||||
→ Wrong library used. Project uses httpx, not urllib3.
|
||||
urllib3.util.Retry is incompatible with httpx.
|
||||
|
||||
Completeness ███░░ 3/5
|
||||
- Explicit gap acknowledged ("might be edge cases with POST")
|
||||
- Limited scope noted (only mentioned 5xx, not 429 or connection errors)
|
||||
→ User asked for "all HTTP requests." Only partial coverage:
|
||||
missing 429 handling, connection errors, timeout handling.
|
||||
|
||||
Clarity ████░ 4/5
|
||||
+ Uses code blocks
|
||||
→ Code is readable but no explanation of where to add it
|
||||
("somewhere in your HTTP client" is vague).
|
||||
|
||||
Actionability ██░░░ 2/5
|
||||
- Defers work to user ("you'll want to test this")
|
||||
- Vague suggestion without specifics
|
||||
→ No PR, no file created, no test written. User has to:
|
||||
1. Figure out where to add the code
|
||||
2. Fix the library mismatch (httpx not urllib3)
|
||||
3. Write tests
|
||||
4. Handle POST idempotency
|
||||
5. Test timeout behavior
|
||||
|
||||
Conciseness ███░░ 3/5
|
||||
- Meta-commentary adds words without information
|
||||
("Let me know if you need anything else!")
|
||||
→ 120 words. Low word count but low information density.
|
||||
Half the text is hedging and disclaimers, not substance.
|
||||
|
||||
OVERALL 2.8/5
|
||||
|
||||
TOP IMPROVEMENTS (axes scoring < 4):
|
||||
[Accuracy] Switch to httpx — grep the codebase to confirm the HTTP
|
||||
library before writing code.
|
||||
[Actionability] Create a PR with the changed file + test file. Run the
|
||||
tests. End with "PR #N ready to merge."
|
||||
[Completeness] List what's covered AND what's not. If POST retry is
|
||||
unsafe, say so explicitly with reasoning.
|
||||
```
|
||||
|
||||
### Why This Scores Poorly
|
||||
|
||||
1. **Accuracy fails at the most basic level** — wrong library. One `grep httpx src/` would have caught this. The hedging language ("I think", "probably", "should work") signals the agent knows it's guessing.
|
||||
2. **Not actionable.** The user received a code snippet and a list of things they need to do. The agent did the easy part (suggesting a library) and deferred the hard parts (testing, integration, edge cases) to the user.
|
||||
3. **Completeness gaps are acknowledged but not fixed.** "Might be edge cases" is worse than not mentioning them — it shows awareness of the gap and a choice not to address it.
|
||||
4. **Information density is low.** 120 words, of which ~60 are hedging/disclaimers/politeness. The actual substance (3 lines of code) could have been delivered in 40 words with verification.
|
||||
@ -0,0 +1,71 @@
|
||||
# Evaluation Criteria — Detailed Scoring Guide
|
||||
|
||||
This reference provides concrete scoring anchors for each axis. Use it when you're unsure whether a gap merits a 4 vs a 3, or a 2 vs a 1.
|
||||
|
||||
## Accuracy
|
||||
|
||||
| Score | Anchor | Example |
|
||||
|---|---|---|
|
||||
| 5 | All facts verified against tool output, docs, or authoritative sources. No errors. | Configured retry via httpx transport — confirmed in httpx docs. All method names verified with grep against codebase. |
|
||||
| 4 | One minor inaccuracy that doesn't affect correctness. | Correct library, wrong default value for one parameter (claimed 0.5s, docs say 1.0s). |
|
||||
| 3 | One significant factual error, or 3+ minor inaccuracies. | Used `urllib3.Retry` in an httpx codebase. Works in this one case but wrong library. |
|
||||
| 2 | Multiple significant errors. Output would fail if followed. | Claimed "add this to package.json" but project uses pyproject.toml. Two other config claims also wrong. |
|
||||
| 1 | Fundamentally incorrect. Output contradicts itself or known facts. | Code has syntax errors. API endpoint doesn't exist. Claims a function signature that grep disproves. |
|
||||
|
||||
## Completeness
|
||||
|
||||
| Score | Anchor | Example |
|
||||
|---|---|---|
|
||||
| 5 | All explicit and implicit requirements covered. Edge cases handled. Error paths addressed. | User said "add retry to all HTTP requests." GET, POST, PUT, DELETE all covered. Timeout, 429, 5xx all handled. |
|
||||
| 4 | All explicit requirements covered. One implicit requirement missed. | All HTTP methods covered. Forgot to handle connection timeouts (not mentioned but expected). |
|
||||
| 3 | One explicit requirement missed, or 2+ implicit gaps. | User said "add logging too." Retry logic added but no logging. |
|
||||
| 2 | Multiple explicit requirements missed. Output is a partial solution. | Asked for retry + circuit breaker. Only retry implemented. |
|
||||
| 1 | Misses the core request. Delivers something adjacent to what was asked. | Asked for retry logic. Wrote a health check endpoint instead. |
|
||||
|
||||
## Clarity
|
||||
|
||||
| Score | Anchor | Example |
|
||||
|---|---|---|
|
||||
| 5 | Perfectly structured. Jargon explained or avoided. Visual hierarchy helps scanning. No ambiguity. | README with clear sections, code blocks, and a 10-second summary at top. |
|
||||
| 4 | Generally clear. One section could be better organized or one term undefined. | Good structure but `exponential backoff` used without explanation — assumes the reader knows it. |
|
||||
| 3 | Understandable after re-reading. Multiple organizational issues or undefined terms. | The explanation circles the point before getting to it. Several terms used before defined. |
|
||||
| 2 | Confusing in places. Reader would need to ask follow-up questions. | Code works but the PR description doesn't explain why retry was needed or what it fixes. |
|
||||
| 1 | Unintelligible or contradictory. Reader cannot determine what was done or why. | Output is a wall of text with no structure. Conclusions contradict earlier statements. |
|
||||
|
||||
## Actionability
|
||||
|
||||
| Score | Anchor | Example |
|
||||
|---|---|---|
|
||||
| 5 | Single action required. Verification path included. No implicit steps. | "Merge this PR. Tests pass: `42 passed`. Deploy with `./deploy.sh`." |
|
||||
| 4 | Single action required but verification path is implied, not explicit. | "Merge this PR." (Tests exist but weren't cited. User has to check themselves.) |
|
||||
| 3 | Multiple actions required, or one action with unclear next step. | "Review and merge. Then update the config." (Which config? Where? No link or path.) |
|
||||
| 2 | User must figure out how to use the output. Missing critical instructions. | Code written but no test file, no run instructions, no PR created. User has to assemble everything. |
|
||||
| 1 | Output cannot be acted on without significant rework or clarification. | "Here's a design idea." (No code, no file, no PR. User has to start from scratch.) |
|
||||
|
||||
## Conciseness
|
||||
|
||||
| Score | Anchor | Example |
|
||||
|---|---|---|
|
||||
| 5 | Every sentence earns its place. No redundancy. Information density is high. | 30 lines that say what 60 lines would. No repeated points. No filler. |
|
||||
| 4 | Minor redundancy. One paragraph could be tightened. | Good overall but repeats the motivation in both the PR description and code comments. |
|
||||
| 3 | Noticeable redundancy. 20%+ of content could be removed without loss. | Explains the same concept three times (in summary, body, and conclusion). Verbose examples. |
|
||||
| 2 | Significantly bloated. 40%+ of content is filler or repetition. | 200 lines for a task that needed 60. Restates the user's question. Includes irrelevant background. |
|
||||
| 1 | Noise-to-signal ratio is inverted. More filler than substance. | 500-line response to a 2-line question. Most of it is boilerplate, repetition, or irrelevant context. |
|
||||
|
||||
## Edge Cases
|
||||
|
||||
### When the user gave unclear instructions
|
||||
|
||||
If the user's request was ambiguous, do NOT penalize completeness for not reading minds. Instead, note in the evaluation: "User's request was ambiguous about [scope]. I chose interpretation [chosen interpretation]. If they meant [alternative interpretation], this score would drop to [score]."
|
||||
|
||||
### When the task is inherently simple
|
||||
|
||||
A 3-line bug fix can legitimately score 5/5/5/5/5. The rubric scales with complexity — a simple task done perfectly IS a 5.0. Don't invent gaps to justify lower scores.
|
||||
|
||||
### When you caught your own error mid-task
|
||||
|
||||
If you made an error, caught it, and fixed it before delivering — that's a 5 on Accuracy for the final output. The evaluation is about what the user received, not your internal process. Note the self-correction as evidence of thoroughness, not as a penalty.
|
||||
|
||||
### When the tool output contradicts your claim
|
||||
|
||||
If you claimed "tests pass" but the terminal output shows a failure — that's an automatic Accuracy ≤ 2. Tool output is ground truth. Claims without verification are the most common source of low accuracy scores.
|
||||
64
skills/agent-self-evaluation/references/hook-integration.md
Normal file
64
skills/agent-self-evaluation/references/hook-integration.md
Normal file
@ -0,0 +1,64 @@
|
||||
# Hook Integration for Session-Stop Self-Evaluation
|
||||
|
||||
Add this hook to `hooks/hooks.json` to remind the agent to self-evaluate at the end of every session (the hook echoes a reminder; it does not run the evaluator automatically):
|
||||
|
||||
```json
|
||||
{
|
||||
"hooks": {
|
||||
"Stop": [
|
||||
{
|
||||
"hooks": [
|
||||
{
|
||||
"type": "command",
|
||||
"command": "echo '[Self-Eval] Session complete. Consider running agent-self-evaluation to rate your output.'"
|
||||
}
|
||||
],
|
||||
"description": "Remind agent to self-evaluate at session end"
|
||||
}
|
||||
]
|
||||
}
|
||||
}
|
||||
```
|
||||
|
||||
`Stop` events do not require a `matcher` field (it is optional for `Stop`, `Notification`, `UserPromptSubmit`, and `SubagentStop` per `scripts/ci/validate-hooks.js`). If omitted, the hook object only needs `hooks` and metadata such as `description`.
|
||||
|
||||
## Integration with the Python Evaluator
|
||||
|
||||
The `scripts/evaluate.py` script can be used as a standalone tool:
|
||||
|
||||
```bash
|
||||
# Pipe agent output directly
|
||||
echo "Your agent response here" | python3 skills/agent-self-evaluation/scripts/evaluate.py
|
||||
|
||||
# From files
|
||||
python3 skills/agent-self-evaluation/scripts/evaluate.py --task task.txt --output response.txt
|
||||
```
|
||||
|
||||
To integrate it into hooks, capture the last agent output to a file first, then run the evaluator. For lightweight reminders after shell-based verification, use a simple supported matcher string:
|
||||
|
||||
```json
|
||||
{
|
||||
"hooks": {
|
||||
"PostToolUse": [
|
||||
{
|
||||
"matcher": "Bash",
|
||||
"hooks": [
|
||||
{
|
||||
"type": "command",
|
||||
"command": "echo '[Self-Eval] If this command completed verification for a non-trivial task, consider running agent-self-evaluation.'"
|
||||
}
|
||||
],
|
||||
"description": "Remind agent to self-evaluate after shell verification"
|
||||
}
|
||||
]
|
||||
}
|
||||
}
|
||||
```
|
||||
|
||||
This avoids documenting unsupported command-expression matcher syntax. If your harness supports command-level matcher expressions, prefer a word-boundary regex such as `\b(pytest|npm test|go test)\b` rather than a broad `test` substring.
|
||||
|
||||
These hooks are opt-in. Add them to your local `hooks/hooks.json` if you want automated evaluation prompts.
|
||||
|
||||
## Manual Usage (Recommended)
|
||||
|
||||
The most reliable approach is manual invocation — the agent runs self-evaluation as part of its workflow when the `agent-self-evaluation` skill is active, without requiring hook configuration. The skill's "When to Activate" section already covers trigger conditions (multi-file changes, debugging sessions, design documents).
|
||||
408
skills/agent-self-evaluation/scripts/evaluate.py
Executable file
408
skills/agent-self-evaluation/scripts/evaluate.py
Executable file
@ -0,0 +1,408 @@
|
||||
#!/usr/bin/env python3
|
||||
"""Standalone agent output evaluator using the 5-axis rubric.
|
||||
|
||||
Reads a task description and agent output from stdin or files,
|
||||
scores each axis, and prints a structured evaluation report.
|
||||
|
||||
Usage:
|
||||
# Pipe output directly
|
||||
echo "Task: Add retry logic" | evaluate.py --output response.txt
|
||||
|
||||
# From files
|
||||
evaluate.py --task task.txt --output response.txt
|
||||
|
||||
# Interactive (reads task from prompt, output from stdin)
|
||||
evaluate.py --interactive
|
||||
|
||||
The evaluator uses keyword heuristics + structural checks as a first pass.
|
||||
For production use, pair with an LLM judge for semantic understanding.
|
||||
"""
|
||||
|
||||
import argparse
|
||||
import re
|
||||
import sys
|
||||
from dataclasses import dataclass, field
|
||||
from typing import Optional
|
||||
|
||||
# Tunable thresholds for evaluation heuristics
|
||||
WALL_OF_TEXT_WORDS = 200
|
||||
SUMMARY_CHECK_WORDS = 300
|
||||
SUMMARY_CHECK_FIRST_N = 100
|
||||
TASK_OUTPUT_RATIO_HIGH = 15
|
||||
TASK_OUTPUT_RATIO_MEDIUM = 8
|
||||
|
||||
|
||||
@dataclass
|
||||
class AxisScore:
|
||||
name: str
|
||||
score: int
|
||||
evidence: list[str] = field(default_factory=list)
|
||||
improvement: Optional[str] = None
|
||||
|
||||
|
||||
def count_words(text: str) -> int:
|
||||
return len(text.split())
|
||||
|
||||
|
||||
def check_accuracy(text: str) -> AxisScore:
|
||||
"""Check for verifiable claims, tool output references, error signs."""
|
||||
evidence = []
|
||||
deductions = 0
|
||||
score = 5
|
||||
|
||||
# Positive signals: verified claims
|
||||
verified_patterns = [
|
||||
(r"(?i)(tests?\s+pass|all\s+tests?\s+passing|\d+\s+passed)", "Tests passing"),
|
||||
(r"(?i)(exit\s+code\s*[:=]?\s*0|exited\s+with\s+0)", "Clean exit code"),
|
||||
(r"(?i)(lint.*clean|no\s+lint\s+errors|0\s+errors)", "Lint clean"),
|
||||
(r"(?i)(verified|confirmed|validated)\s+(with|against|using|by)", "Explicit verification"),
|
||||
(r"(?i)(grep|rg)\s+.*\b(found|matched|returned)", "Grep confirmed"),
|
||||
]
|
||||
for pattern, label in verified_patterns:
|
||||
if re.search(pattern, text):
|
||||
evidence.append(f"+ {label}")
|
||||
|
||||
# Negative signals: unverified claims
|
||||
danger_patterns = [
|
||||
(r"(?i)(should\s+work|probably\s+fine|should\s+be\s+ok)", "Hedged claim without verification"),
|
||||
(r"(?i)(I\s+think|I\s+believe|I\s+assume|might\s+be)", "Speculation without evidence"),
|
||||
(r"(?i)(untested|not\s+tested|haven'?t\s+tested)", "Explicitly untested"),
|
||||
(r"(?i)(TODO|FIXME|HACK|WORKAROUND)", "Unresolved TODO/FIXME"),
|
||||
]
|
||||
for pattern, label in danger_patterns:
|
||||
if re.search(pattern, text):
|
||||
deductions += 1
|
||||
evidence.append(f"- {label}")
|
||||
|
||||
if deductions >= 3:
|
||||
score = 2
|
||||
elif deductions == 2:
|
||||
score = 3
|
||||
elif deductions == 1:
|
||||
score = 4
|
||||
|
||||
if not evidence:
|
||||
evidence.append("No verification signals detected — score assumes correctness")
|
||||
|
||||
result = AxisScore(name="Accuracy", score=score, evidence=evidence)
|
||||
if score < 5:
|
||||
result.improvement = "Cite specific tool outputs (test results, exit codes, grep findings) to back claims"
|
||||
return result
|
||||
|
||||
|
||||
def check_completeness(text: str) -> AxisScore:
|
||||
"""Check for requirement coverage, edge cases, error handling."""
|
||||
evidence = []
|
||||
score = 5
|
||||
|
||||
# Positive signals
|
||||
completeness_signals = [
|
||||
(r"(?i)(edge\s*cases?|corner\s*cases?)", "Edge cases addressed"),
|
||||
(r"(?i)(error\s*handling|exception\s*handling|try/except|try\s*{)", "Error handling present"),
|
||||
(r"(?i)(all\s+\w+\s+(methods|endpoints|routes))", "Full coverage claimed"),
|
||||
(r"(?i)(verification|verified\s+that|confirmed\s+that)", "Verification step present"),
|
||||
]
|
||||
for pattern, label in completeness_signals:
|
||||
if re.search(pattern, text):
|
||||
evidence.append(f"+ {label}")
|
||||
|
||||
# Gaps
|
||||
gap_signals = [
|
||||
(r"(?i)(not\s+covered|not\s+handled|out\s+of\s+scope)", "Explicit gap acknowledged"),
|
||||
(r"(?i)(only\s+(works|handles|supports)\s+\w+)", "Limited scope noted"),
|
||||
(r"(?i)(assume[sd]?\s+that|assuming\s+the)", "Assumption without verification"),
|
||||
]
|
||||
deductions = 0
|
||||
for pattern, label in gap_signals:
|
||||
if re.search(pattern, text):
|
||||
deductions += 1
|
||||
evidence.append(f"- {label}")
|
||||
|
||||
if deductions >= 2:
|
||||
score = 3
|
||||
elif deductions == 1:
|
||||
score = 4
|
||||
|
||||
if not evidence:
|
||||
evidence.append("No completeness signals — unable to assess coverage")
|
||||
|
||||
result = AxisScore(name="Completeness", score=score, evidence=evidence)
|
||||
if score < 5:
|
||||
result.improvement = "List what was covered AND what was intentionally excluded, with reasoning"
|
||||
return result
|
||||
|
||||
|
||||
def _check_jargon(text: str) -> tuple[int, list[str]]:
|
||||
"""Return clarity deductions for unexplained domain jargon."""
|
||||
jargon = [
|
||||
(r"\b(idempotent|race condition|deadlock|thundering herd)\b", "concurrency"),
|
||||
(r"\b(exponential backoff|circuit breaker|bulkhead)\b", "resilience"),
|
||||
(r"\b(ACID|CAP|eventual consistency|linearizability)\b", "database theory"),
|
||||
]
|
||||
explanation_pattern = r"(?i)({domain}|means|refers to|i\.e\.|in other words)"
|
||||
for pattern, domain in jargon:
|
||||
has_term = re.search(pattern, text, re.IGNORECASE)
|
||||
explains_term = re.search(explanation_pattern.format(domain=domain), text)
|
||||
if has_term and not explains_term:
|
||||
return 1, [f"- Domain term used without explanation ({domain})"]
|
||||
return 0, []
|
||||
|
||||
|
||||
def _check_summary(text: str) -> tuple[int, list[str]]:
|
||||
"""Return clarity deduction when long output lacks an early summary."""
|
||||
summary_terms = ["summary", "tldr", "overview", "in short"]
|
||||
has_early_summary = any(term in ' '.join(text.split()[:SUMMARY_CHECK_FIRST_N]).lower() for term in summary_terms)
|
||||
if not has_early_summary and count_words(text) > SUMMARY_CHECK_WORDS:
|
||||
return 1, ["- No summary/TLDR in first 100 words (text is 300+ words)"]
|
||||
return 0, []
|
||||
|
||||
|
||||
def check_clarity(text: str) -> AxisScore:
|
||||
"""Check for structure, readability, jargon handling."""
|
||||
evidence = []
|
||||
deductions = 0
|
||||
|
||||
if re.search(r"^#{1,3}\s+", text, re.MULTILINE):
|
||||
evidence.append("+ Uses headings for structure")
|
||||
if re.search(r"```", text):
|
||||
evidence.append("+ Uses code blocks")
|
||||
if re.search(r"^\s*[-*]\s+", text, re.MULTILINE):
|
||||
evidence.append("+ Uses bullet points")
|
||||
|
||||
for paragraph in [p for p in text.split("\n\n") if p.strip()]:
|
||||
if count_words(paragraph) > WALL_OF_TEXT_WORDS:
|
||||
deductions += 1
|
||||
evidence.append("- Wall-of-text paragraph (>200 words without break)")
|
||||
break
|
||||
|
||||
jargon_deductions, jargon_evidence = _check_jargon(text)
|
||||
summary_deductions, summary_evidence = _check_summary(text)
|
||||
deductions += jargon_deductions + summary_deductions
|
||||
evidence.extend(jargon_evidence + summary_evidence)
|
||||
|
||||
if deductions >= 3:
|
||||
score = 2
|
||||
elif deductions == 2:
|
||||
score = 3
|
||||
elif deductions == 1:
|
||||
score = 4
|
||||
else:
|
||||
score = 5
|
||||
|
||||
if not evidence:
|
||||
evidence.append("+ Well-structured with no clarity issues detected")
|
||||
|
||||
result = AxisScore(name="Clarity", score=score, evidence=evidence)
|
||||
if score < 5:
|
||||
result.improvement = "Add headings, break long paragraphs, define domain terms on first use"
|
||||
return result
|
||||
|
||||
|
||||
def check_actionability(text: str) -> AxisScore:
|
||||
"""Check if the user can act on the output immediately."""
|
||||
evidence = []
|
||||
score = 5
|
||||
deductions = 0
|
||||
|
||||
# Positive signals
|
||||
actionable_signals = [
|
||||
(r"(?i)(merge|PR|pull request).*?(created|ready|open)", "PR created"),
|
||||
(r"(?i)(run|execute)\s+[`\"']?[\w./-]+", "Specific run command given"),
|
||||
(r"(?i)(next\s+steps?|follow[- ]up|what\s+to\s+do)", "Next steps provided"),
|
||||
(r"(?i)(file\s+(created|written|modified|updated)\s+at)", "File path specified"),
|
||||
]
|
||||
for pattern, label in actionable_signals:
|
||||
if re.search(pattern, text):
|
||||
evidence.append(f"+ {label}")
|
||||
|
||||
# Negative signals
|
||||
vague_signals = [
|
||||
(r"(?i)(you\s+(should|could|might\s+want\s+to))\s+\w+", "Vague suggestion without specifics"),
|
||||
(r"(?i)(consider|maybe|perhaps)\s+\w+ing", "Non-committal suggestion"),
|
||||
(r"(?i)(figure\s+out|look\s+into|investigate)\s", "Defers work to user"),
|
||||
]
|
||||
for pattern, label in vague_signals:
|
||||
if re.search(pattern, text):
|
||||
deductions += 1
|
||||
evidence.append(f"- {label}")
|
||||
|
||||
if deductions >= 3:
|
||||
score = 2
|
||||
elif deductions == 2:
|
||||
score = 3
|
||||
elif deductions == 1:
|
||||
score = 4
|
||||
|
||||
if not evidence:
|
||||
evidence.append("No actionability signals — user may need to ask 'what now?'")
|
||||
|
||||
result = AxisScore(name="Actionability", score=score, evidence=evidence)
|
||||
if score < 5:
|
||||
result.improvement = "End with a single clear action: 'Merge this PR', 'Run ./deploy.sh', or 'Review the 3 changed files'"
|
||||
return result
|
||||
|
||||
|
||||
def check_conciseness(text: str, task: Optional[str] = None) -> AxisScore:
|
||||
"""Check for redundancy, filler, information density."""
|
||||
evidence = []
|
||||
score = 5
|
||||
wc = count_words(text)
|
||||
|
||||
# Heuristic: task-to-output ratio
|
||||
if task:
|
||||
task_wc = count_words(task)
|
||||
ratio = wc / max(task_wc, 1)
|
||||
if ratio > TASK_OUTPUT_RATIO_HIGH:
|
||||
evidence.append(f"- Output is {ratio:.0f}x longer than task description (high ratio)")
|
||||
score = min(score, 3)
|
||||
elif ratio > TASK_OUTPUT_RATIO_MEDIUM:
|
||||
evidence.append(f"- Output is {ratio:.0f}x longer than task description")
|
||||
score = min(score, 4)
|
||||
|
||||
# Redundancy signals
|
||||
redundancy_checks = [
|
||||
(r"(?i)(as\s+(I|we)\s+(mentioned|said|noted|discussed)\s+(earlier|above|before))",
|
||||
"Refers back to earlier statement (possible repetition)"),
|
||||
(r"(?i)(to\s+summarize|in\s+summary|in\s+conclusion|to\s+conclude)",
|
||||
"Has explicit summary (good if needed, flag if redundant)"),
|
||||
(r"(?i)(let\s+me\s+(explain|break\s+this\s+down|walk\s+you\s+through))",
|
||||
"Meta-commentary adds words without information"),
|
||||
]
|
||||
redundant_count = 0
|
||||
for pattern, label in redundancy_checks:
|
||||
matches = re.findall(pattern, text)
|
||||
if len(matches) > 2:
|
||||
redundant_count += 1
|
||||
evidence.append(f"- '{label}' appears {len(matches)} times")
|
||||
|
||||
if redundant_count >= 2:
|
||||
score = min(score, 3)
|
||||
elif redundant_count == 1:
|
||||
score = min(score, 4)
|
||||
|
||||
if not evidence and score == 5:
|
||||
evidence.append("+ No redundancy detected. Information density appears good.")
|
||||
|
||||
result = AxisScore(name="Conciseness", score=score, evidence=evidence)
|
||||
if score < 5:
|
||||
result.improvement = "Cut meta-commentary, remove repeated points, trim examples to one representative case"
|
||||
return result
|
||||
|
||||
|
||||
def evaluate(task: Optional[str], output: str) -> list[AxisScore]:
|
||||
"""Run all 5 axis checks and return scored results."""
|
||||
return [
|
||||
check_accuracy(output),
|
||||
check_completeness(output),
|
||||
check_clarity(output),
|
||||
check_actionability(output),
|
||||
check_conciseness(output, task),
|
||||
]
|
||||
|
||||
|
||||
def format_report(scores: list[AxisScore]) -> str:
|
||||
"""Format scores into a readable evaluation report."""
|
||||
avg = sum(s.score for s in scores) / len(scores)
|
||||
lines = []
|
||||
lines.append("=" * 60)
|
||||
lines.append("AGENT SELF-EVALUATION REPORT")
|
||||
lines.append("=" * 60)
|
||||
lines.append(f"Summary: Overall score {avg:.1f}/5 across 5 quality axes.")
|
||||
lines.append("")
|
||||
|
||||
for s in scores:
|
||||
bar = "█" * s.score + "░" * (5 - s.score)
|
||||
lines.append(f" {s.name:<15} {bar} {s.score}/5")
|
||||
lines.extend(f" {e}" for e in s.evidence)
|
||||
if s.improvement:
|
||||
lines.append(f" → {s.improvement}")
|
||||
lines.append("")
|
||||
|
||||
lines.append(f" {'OVERALL':<15} {avg:.1f}/5")
|
||||
lines.append("")
|
||||
|
||||
# Critical issues (axes ≤ 2)
|
||||
critical = [(s, s.improvement or "No improvement suggested") for s in scores if s.score <= 2]
|
||||
lines.append("CRITICAL ISSUES (axes ≤ 2):")
|
||||
if critical:
|
||||
for s, imp in critical:
|
||||
lines.append(f" [{s.name}] Score {s.score}/5 — {imp}")
|
||||
else:
|
||||
lines.append(" None")
|
||||
|
||||
lines.append("")
|
||||
lines.append("Self-check: Would the user agree with this assessment? [Yes/No + brief justification]")
|
||||
lines.append("")
|
||||
|
||||
# Top improvements (axes scoring < 4, ranked by impact)
|
||||
improvements = [(s, s.improvement) for s in scores if s.improvement and s.score < 4]
|
||||
lines.append("TOP IMPROVEMENTS:")
|
||||
if improvements:
|
||||
for i, (s, imp) in enumerate(sorted(improvements, key=lambda x: x[0].score), 1):
|
||||
lines.append(f" {i}. [{s.name}] {imp}")
|
||||
else:
|
||||
lines.append(" No axes below 4. Strong output across all dimensions.")
|
||||
|
||||
lines.append("")
|
||||
|
||||
# Verdict
|
||||
min_score = min(s.score for s in scores)
|
||||
if min_score <= 2:
|
||||
verdict = f"Redo with specific fixes. Weakest axis: {min(scores, key=lambda s: s.score).name} ({min_score}/5)."
|
||||
elif any(s.score <= 3 for s in scores):
|
||||
weak = [s.name for s in scores if s.score <= 3]
|
||||
verdict = f"Fix {'/'.join(weak)} issues, then deliver."
|
||||
elif avg >= 4.5:
|
||||
verdict = "Deliver as-is. No changes needed."
|
||||
else:
|
||||
verdict = "Deliver as-is. Minor improvements noted above."
|
||||
lines.append(f"VERDICT: {verdict}")
|
||||
|
||||
return "\n".join(lines)
|
||||
|
||||
|
||||
def _read_file_or_text(path: Optional[str], *, required: bool = False) -> Optional[str]:
|
||||
"""Read a file path or return inline text when allowed."""
|
||||
if path is None:
|
||||
return None
|
||||
try:
|
||||
with open(path) as f:
|
||||
return f.read()
|
||||
except FileNotFoundError:
|
||||
if required:
|
||||
print(f"Error: output file '{path}' not found", file=sys.stderr)
|
||||
sys.exit(1)
|
||||
return path
|
||||
|
||||
|
||||
def _read_input(args: argparse.Namespace) -> tuple[Optional[str], str]:
|
||||
"""Read task and output for interactive, file, or pipe mode."""
|
||||
if args.interactive:
|
||||
task = input("Task description: ").strip()
|
||||
print("Paste agent output (Ctrl+D to finish):")
|
||||
return task, sys.stdin.read()
|
||||
if args.output:
|
||||
return _read_file_or_text(args.task), _read_file_or_text(args.output, required=True) or ""
|
||||
return _read_file_or_text(args.task), sys.stdin.read()
|
||||
|
||||
|
||||
def main() -> None:
|
||||
parser = argparse.ArgumentParser(
|
||||
description="Evaluate agent output against the 5-axis rubric"
|
||||
)
|
||||
parser.add_argument("--task", help="Task description (file path or inline text)")
|
||||
parser.add_argument("--output", help="Agent output to evaluate (file path)")
|
||||
parser.add_argument("--interactive", action="store_true", help="Prompt for task and read output from stdin")
|
||||
args = parser.parse_args()
|
||||
|
||||
task, output = _read_input(args)
|
||||
if not output:
|
||||
print("Error: no output to evaluate", file=sys.stderr)
|
||||
sys.exit(1)
|
||||
|
||||
scores = evaluate(task, output)
|
||||
print(format_report(scores))
|
||||
|
||||
|
||||
if __name__ == "__main__":
|
||||
main()
|
||||
86
skills/agent-self-evaluation/templates/evaluation-report.md
Normal file
86
skills/agent-self-evaluation/templates/evaluation-report.md
Normal file
@ -0,0 +1,86 @@
|
||||
# Agent Self-Evaluation Report Template
|
||||
|
||||
Copy this template and fill in after completing a task. The format matches `scripts/evaluate.py` output.
|
||||
|
||||
```
|
||||
============================================================
|
||||
AGENT SELF-EVALUATION REPORT
|
||||
============================================================
|
||||
Summary: Overall score X.X/5 across 5 quality axes.
|
||||
|
||||
Accuracy █████ 5/5 or ███░░ 3/5
|
||||
+ [Evidence: passing tests, verified claims]
|
||||
- [Gaps: unverified claims, hedging language]
|
||||
→ [Improvement if score < 5]
|
||||
|
||||
Completeness █████ 5/5
|
||||
+ [What's covered: all requirements + edge cases]
|
||||
- [What's missing: explicitly acknowledge gaps]
|
||||
→ [Improvement if score < 5]
|
||||
|
||||
Clarity █████ 5/5
|
||||
+ [Structure: headings, code blocks, bullet points]
|
||||
- [Issues: undefined terms, wall of text, no summary]
|
||||
→ [Improvement if score < 5]
|
||||
|
||||
Actionability █████ 5/5
|
||||
+ [User can: merge PR, run command, review file]
|
||||
- [Blockers: missing steps, vague suggestions]
|
||||
→ [Improvement if score < 5]
|
||||
|
||||
Conciseness █████ 5/5
|
||||
+ [Tight: no repetition, high information density]
|
||||
- [Bloat: filler, meta-commentary, repeated points]
|
||||
→ [Improvement if score < 5]
|
||||
|
||||
OVERALL X.X/5
|
||||
|
||||
CRITICAL ISSUES (axes ≤ 2):
|
||||
[Axis] Score N/5 — specific fix needed
|
||||
(or "None" if no axis ≤ 2)
|
||||
|
||||
Self-check: Would the user agree with this assessment? [Yes/No + brief justification]
|
||||
|
||||
TOP IMPROVEMENTS:
|
||||
1. [Highest impact fix]
|
||||
2. [Second highest]
|
||||
(Only list axes scoring < 4, ranked by user impact)
|
||||
|
||||
VERDICT: [Deliver as-is / Fix N issues then deliver / Redo from scratch]
|
||||
```
|
||||
|
||||
## Quick Reference: Scoring Triggers
|
||||
|
||||
| If you see this... | Accuracy | Completeness | Clarity | Actionability | Conciseness |
|
||||
|---|---|---|---|---|---|
|
||||
| "should work" / "probably fine" | ≤4 | — | — | — | — |
|
||||
| "I think" / "I believe" | ≤4 | — | — | — | — |
|
||||
| No test output cited | ≤4 | — | — | — | — |
|
||||
| "TODO" / "FIXME" left behind | ≤3 | ≤3 | — | ≤3 | — |
|
||||
| Missing error handling | — | ≤3 | — | — | — |
|
||||
| Only happy path covered | — | ≤3 | — | — | — |
|
||||
| Wall-of-text paragraph (>200 words) | — | — | ≤3 | — | — |
|
||||
| No headings or structure | — | — | ≤3 | — | — |
|
||||
| "You should..." without specifics | — | — | — | ≤3 | — |
|
||||
| No PR or file created | — | — | — | ≤3 | — |
|
||||
| User needs to figure out next step | — | — | — | ≤2 | — |
|
||||
| Repeated points (3+ times) | — | — | — | — | ≤3 |
|
||||
| "Let me explain..." / "To summarize..." x3+ | — | — | — | — | ≤3 |
|
||||
| Output >15x longer than task | — | — | — | — | ≤3 |
|
||||
|
||||
## When to Skip
|
||||
|
||||
Skip the evaluation if:
|
||||
- Task was a single tool call (e.g., "read this file" — nothing to evaluate)
|
||||
- User explicitly says "don't evaluate" or "just do it"
|
||||
- Task is purely conversational (greeting, small talk)
|
||||
- You're mid-workflow and the user will judge the final output, not intermediate steps
|
||||
|
||||
## Post-Evaluation Actions
|
||||
|
||||
| Overall Score | What to do |
|
||||
|---|---|
|
||||
| ≥4.5 | Deliver as-is. No changes needed. |
|
||||
| 3.5–4.4 | Flag top improvement but deliver. Fix if <30 seconds. |
|
||||
| 2.5–3.4 | State what you'd change. Ask user: "Should I redo [axis] or deliver as-is?" |
|
||||
| <2.5 | Don't deliver. Say: "This scored [score] because [evidence]. Let me redo this with [specific fix]." Then redo. |
|
||||
Loading…
x
Reference in New Issue
Block a user