Hawthorn 149be89d39 fix: address final lint blockers for agent self-evaluation
- Replace U+274C cross-mark examples with ASCII FAIL: prefixes
- Ensure agent-evaluator markdown ends with trailing newline
- Replace markdown placeholder underscores with bracketed placeholders to satisfy markdownlint MD037
2026-06-11 17:58:57 +05:30

183 lines
7.4 KiB
Markdown

---
name: agent-self-evaluation
description: Use after completing any non-trivial task. The agent self-rates its output on 5 axes — accuracy, completeness, clarity, actionability, conciseness — with concrete evidence per criterion. Produces a structured 1-5 scorecard with specific improvement suggestions.
origin: ECC
---
# Agent Self-Evaluation
After completing a complex task, the agent pauses to rate its own output against a structured 5-axis rubric. This is NOT a pass/fail gate — it's a deliberate reflection step that catches omissions, flags overconfidence, and surface areas for improvement before the user has to.
## When to Activate
- After writing code that spans 3+ files or 50+ lines
- After completing a multi-step workflow (implement → test → review)
- After a debugging session that involved 3+ attempts
- After producing a design document, architecture decision, or written analysis
- When the user asks "how good was that?" or "rate yourself"
- At the end of any session Stop hook (if configured — see `references/hook-integration.md`)
## Core Concepts
### The 5 Evaluation Axes
| Axis | Question | What it catches |
|---|---|---|
| **Accuracy** | Are the facts, claims, and outputs correct? | Hallucinations, wrong API names, incorrect syntax, false statements |
| **Completeness** | Did it cover everything the user asked for? | Missed edge cases, unhandled error paths, forgotten requirements, skipped subtasks |
| **Clarity** | Is the explanation understandable and well-structured? | Confusing explanations, jargon without definition, missing context, rambling |
| **Actionability** | Can the user act on the output immediately? | Vague suggestions, missing steps, "you should X" without showing how, no verification path |
| **Conciseness** | Did it use the minimum words/tokens needed? | Redundancy, over-explanation, repeating the user's question verbatim, filler content |
### Scoring Scale
```
5 — Exceptional: no reasonable improvement possible
4 — Good: minor nits only, no substantive gaps
3 — Adequate: meets the request but has a notable weakness on at least one axis
2 — Weak: has a clear gap that affects usability or correctness
1 — Poor: fundamentally misses the request or contains significant errors
```
### The Evidence Rule
Every score below 5 MUST cite specific evidence. A score of 3 cannot just say "could be better" — it must say exactly what is missing or wrong. The mantra: **"Show the gap, don't just name it."**
## Workflow
### Step 1: Collect the Raw Material
Gather what you'll evaluate:
```
- The original user request (read back from conversation)
- Your final response/output (the deliverable)
- Any tool outputs that verify correctness (test results, exit codes, lint output)
- Any user feedback received during the task (corrections, "try again", "that's not right")
```
### Step 2: Score Each Axis Independently
Work through the 5 axes one at a time. For each:
1. Read the axis question
2. Find evidence (or lack of evidence) in the output
3. Assign a score 1-5
4. If score < 5, write a one-sentence improvement note citing the gap
Do NOT average the scores in your head first and then work backwards. Score each axis fresh.
### Step 3: Produce the Evaluation Report
Use the template from `templates/evaluation-report.md`. The report must include:
```
- One-line summary
- 5-axis scorecard (score + evidence per axis)
- Overall score (simple average, rounded to 1 decimal)
- 1-3 specific improvements ranked by impact
- Self-check: "Would the user agree with this assessment?"
```
### Step 4: Apply the Improvement
If any axis scored 3 or below:
1. State what you would do differently
2. If the gap is fixable in < 30 seconds (missing link, unclear phrasing), fix it now
3. If the gap requires rework, flag it explicitly: "This axis scored [reason] because [evidence]. Re-running with [specific fix] would likely raise it to [score]."
## Code Examples
### Example: Good Evaluation (Score 4+)
```
Task: Add retry logic to HTTP client
Scorecard:
Accuracy: 5 — All API calls correct. Verified: retries use
exponential backoff. No hallucinated methods.
Completeness: 4 — Covered happy path + 3 error cases. Missing:
timeout handling for hung connections.
Clarity: 5 — Code comments explain backoff formula.
PR description links to incident that motivated this.
Actionability:5 — Single merge. No follow-up tasks. Tests pass.
Conciseness: 4 — 47 lines total. The retry loop could be extracted
into a helper to drop ~8 lines.
Overall: 4.6 — One gap (timeout handling). Fix before merging.
```
### Example: Weak Evaluation (Score 2-3)
```
Task: Add retry logic to HTTP client
Scorecard:
Accuracy: 2 — Used urllib3 which doesn't match our
httpx-based codebase. Wrong library.
Completeness: 3 — Works for GET. POST/PUT not handled (user
said "all HTTP requests").
Clarity: 4 — Code is readable. Good variable names.
Actionability:2 — "Add tests" mentioned but no test file created.
User has to write tests before merging.
Conciseness: 3 — 120 lines. The retry config is duplicated in
3 places instead of one shared RetryConfig object.
Overall: 2.8 — Wrong library used. Needs httpx rewrite.
Fix accuracy first (switch to httpx), then extend to all
HTTP methods, then consolidate config.
```
## Anti-Patterns
### "Everything is a 5"
```
FAIL: Accuracy: 5 — All good.
Completeness: 5 — Everything covered.
Clarity: 5 — Clear.
```
No evidence cited. This is self-congratulation, not evaluation. A real 5 requires proving there's nothing to improve.
### Over-penalizing for scope creep
```
FAIL: Completeness: 2 — Didn't handle WebSocket connections or
gRPC streaming (user didn't ask for these)
```
Only evaluate against what the user actually requested, not what you could have additionally built.
### Using the evaluation to re-litigate
```
FAIL: "As I said earlier, this approach is wrong. Score: 1"
```
The evaluation is about the delivered output, not about re-arguing design decisions that were already made. If the approach was wrong, that should have been caught before delivery.
### Mixing personal preference with objective gaps
```
FAIL: "Score: 3. I don't like Python decorators."
```
"Don't like" is not evidence. Cite a concrete readability, testability, or correctness concern, or leave the score at 4+.
## Best Practices
- **Evaluate the output, not the process.** The user cares about what you delivered, not how many iterations you took.
- **One improvement per weak axis.** Don't list 5 things for one axis pick the highest-impact gap.
- **Tie improvements to user impact.** "Missing error handling means the user's API call will crash silently" beats "add error handling."
- **Be specific about what 'fixed' looks like.** "Re-run with httpx transport configured for retries" beats "fix the library issue."
- **Use tool outputs as evidence.** If tests passed, cite them. If lint is clean, cite it. Don't guess grep for the proof.
- **If you can't find any gaps, try harder.** A perfect score across all 5 axes is rare. Ask: "If I were the user, what would annoy me about this output?"
## Related Skills
- `agent-eval` Head-to-head comparison of different coding agents on benchmark tasks
- `verification-loop` Systematic verification of outputs against expected results
- `security-review` Security-focused code review checklist