mirror of
https://github.com/affaan-m/everything-claude-code.git
synced 2026-06-16 16:36:53 +08:00
- Replace U+274C cross-mark examples with ASCII FAIL: prefixes - Ensure agent-evaluator markdown ends with trailing newline - Replace markdown placeholder underscores with bracketed placeholders to satisfy markdownlint MD037
183 lines
7.4 KiB
Markdown
183 lines
7.4 KiB
Markdown
---
|
|
name: agent-self-evaluation
|
|
description: Use after completing any non-trivial task. The agent self-rates its output on 5 axes — accuracy, completeness, clarity, actionability, conciseness — with concrete evidence per criterion. Produces a structured 1-5 scorecard with specific improvement suggestions.
|
|
origin: ECC
|
|
---
|
|
|
|
# Agent Self-Evaluation
|
|
|
|
After completing a complex task, the agent pauses to rate its own output against a structured 5-axis rubric. This is NOT a pass/fail gate — it's a deliberate reflection step that catches omissions, flags overconfidence, and surface areas for improvement before the user has to.
|
|
|
|
## When to Activate
|
|
|
|
- After writing code that spans 3+ files or 50+ lines
|
|
- After completing a multi-step workflow (implement → test → review)
|
|
- After a debugging session that involved 3+ attempts
|
|
- After producing a design document, architecture decision, or written analysis
|
|
- When the user asks "how good was that?" or "rate yourself"
|
|
- At the end of any session Stop hook (if configured — see `references/hook-integration.md`)
|
|
|
|
## Core Concepts
|
|
|
|
### The 5 Evaluation Axes
|
|
|
|
| Axis | Question | What it catches |
|
|
|---|---|---|
|
|
| **Accuracy** | Are the facts, claims, and outputs correct? | Hallucinations, wrong API names, incorrect syntax, false statements |
|
|
| **Completeness** | Did it cover everything the user asked for? | Missed edge cases, unhandled error paths, forgotten requirements, skipped subtasks |
|
|
| **Clarity** | Is the explanation understandable and well-structured? | Confusing explanations, jargon without definition, missing context, rambling |
|
|
| **Actionability** | Can the user act on the output immediately? | Vague suggestions, missing steps, "you should X" without showing how, no verification path |
|
|
| **Conciseness** | Did it use the minimum words/tokens needed? | Redundancy, over-explanation, repeating the user's question verbatim, filler content |
|
|
|
|
### Scoring Scale
|
|
|
|
```
|
|
5 — Exceptional: no reasonable improvement possible
|
|
4 — Good: minor nits only, no substantive gaps
|
|
3 — Adequate: meets the request but has a notable weakness on at least one axis
|
|
2 — Weak: has a clear gap that affects usability or correctness
|
|
1 — Poor: fundamentally misses the request or contains significant errors
|
|
```
|
|
|
|
### The Evidence Rule
|
|
|
|
Every score below 5 MUST cite specific evidence. A score of 3 cannot just say "could be better" — it must say exactly what is missing or wrong. The mantra: **"Show the gap, don't just name it."**
|
|
|
|
## Workflow
|
|
|
|
### Step 1: Collect the Raw Material
|
|
|
|
Gather what you'll evaluate:
|
|
|
|
```
|
|
- The original user request (read back from conversation)
|
|
- Your final response/output (the deliverable)
|
|
- Any tool outputs that verify correctness (test results, exit codes, lint output)
|
|
- Any user feedback received during the task (corrections, "try again", "that's not right")
|
|
```
|
|
|
|
### Step 2: Score Each Axis Independently
|
|
|
|
Work through the 5 axes one at a time. For each:
|
|
|
|
1. Read the axis question
|
|
2. Find evidence (or lack of evidence) in the output
|
|
3. Assign a score 1-5
|
|
4. If score < 5, write a one-sentence improvement note citing the gap
|
|
|
|
Do NOT average the scores in your head first and then work backwards. Score each axis fresh.
|
|
|
|
### Step 3: Produce the Evaluation Report
|
|
|
|
Use the template from `templates/evaluation-report.md`. The report must include:
|
|
|
|
```
|
|
- One-line summary
|
|
- 5-axis scorecard (score + evidence per axis)
|
|
- Overall score (simple average, rounded to 1 decimal)
|
|
- 1-3 specific improvements ranked by impact
|
|
- Self-check: "Would the user agree with this assessment?"
|
|
```
|
|
|
|
### Step 4: Apply the Improvement
|
|
|
|
If any axis scored 3 or below:
|
|
|
|
1. State what you would do differently
|
|
2. If the gap is fixable in < 30 seconds (missing link, unclear phrasing), fix it now
|
|
3. If the gap requires rework, flag it explicitly: "This axis scored [reason] because [evidence]. Re-running with [specific fix] would likely raise it to [score]."
|
|
|
|
## Code Examples
|
|
|
|
### Example: Good Evaluation (Score 4+)
|
|
|
|
```
|
|
Task: Add retry logic to HTTP client
|
|
|
|
Scorecard:
|
|
Accuracy: 5 — All API calls correct. Verified: retries use
|
|
exponential backoff. No hallucinated methods.
|
|
Completeness: 4 — Covered happy path + 3 error cases. Missing:
|
|
timeout handling for hung connections.
|
|
Clarity: 5 — Code comments explain backoff formula.
|
|
PR description links to incident that motivated this.
|
|
Actionability:5 — Single merge. No follow-up tasks. Tests pass.
|
|
Conciseness: 4 — 47 lines total. The retry loop could be extracted
|
|
into a helper to drop ~8 lines.
|
|
|
|
Overall: 4.6 — One gap (timeout handling). Fix before merging.
|
|
```
|
|
|
|
### Example: Weak Evaluation (Score 2-3)
|
|
|
|
```
|
|
Task: Add retry logic to HTTP client
|
|
|
|
Scorecard:
|
|
Accuracy: 2 — Used urllib3 which doesn't match our
|
|
httpx-based codebase. Wrong library.
|
|
Completeness: 3 — Works for GET. POST/PUT not handled (user
|
|
said "all HTTP requests").
|
|
Clarity: 4 — Code is readable. Good variable names.
|
|
Actionability:2 — "Add tests" mentioned but no test file created.
|
|
User has to write tests before merging.
|
|
Conciseness: 3 — 120 lines. The retry config is duplicated in
|
|
3 places instead of one shared RetryConfig object.
|
|
|
|
Overall: 2.8 — Wrong library used. Needs httpx rewrite.
|
|
Fix accuracy first (switch to httpx), then extend to all
|
|
HTTP methods, then consolidate config.
|
|
```
|
|
|
|
## Anti-Patterns
|
|
|
|
### "Everything is a 5"
|
|
|
|
```
|
|
FAIL: Accuracy: 5 — All good.
|
|
Completeness: 5 — Everything covered.
|
|
Clarity: 5 — Clear.
|
|
```
|
|
|
|
No evidence cited. This is self-congratulation, not evaluation. A real 5 requires proving there's nothing to improve.
|
|
|
|
### Over-penalizing for scope creep
|
|
|
|
```
|
|
FAIL: Completeness: 2 — Didn't handle WebSocket connections or
|
|
gRPC streaming (user didn't ask for these)
|
|
```
|
|
|
|
Only evaluate against what the user actually requested, not what you could have additionally built.
|
|
|
|
### Using the evaluation to re-litigate
|
|
|
|
```
|
|
FAIL: "As I said earlier, this approach is wrong. Score: 1"
|
|
```
|
|
|
|
The evaluation is about the delivered output, not about re-arguing design decisions that were already made. If the approach was wrong, that should have been caught before delivery.
|
|
|
|
### Mixing personal preference with objective gaps
|
|
|
|
```
|
|
FAIL: "Score: 3. I don't like Python decorators."
|
|
```
|
|
|
|
"Don't like" is not evidence. Cite a concrete readability, testability, or correctness concern, or leave the score at 4+.
|
|
|
|
## Best Practices
|
|
|
|
- **Evaluate the output, not the process.** The user cares about what you delivered, not how many iterations you took.
|
|
- **One improvement per weak axis.** Don't list 5 things for one axis — pick the highest-impact gap.
|
|
- **Tie improvements to user impact.** "Missing error handling means the user's API call will crash silently" beats "add error handling."
|
|
- **Be specific about what 'fixed' looks like.** "Re-run with httpx transport configured for retries" beats "fix the library issue."
|
|
- **Use tool outputs as evidence.** If tests passed, cite them. If lint is clean, cite it. Don't guess — grep for the proof.
|
|
- **If you can't find any gaps, try harder.** A perfect score across all 5 axes is rare. Ask: "If I were the user, what would annoy me about this output?"
|
|
|
|
## Related Skills
|
|
|
|
- `agent-eval` — Head-to-head comparison of different coding agents on benchmark tasks
|
|
- `verification-loop` — Systematic verification of outputs against expected results
|
|
- `security-review` — Security-focused code review checklist
|