Hawthorn c0f651cf85 fix: align report format across evaluate.py, agent spec, and template
- evaluate.py: add CRITICAL ISSUES (axes ≤ 2) section, VERDICT line
- agent-evaluator.md: match format_report output exactly (title, evidence markers, bar graphs)
- templates/evaluation-report.md: match evaluate.py output format
- All now produce identical AGENT SELF-EVALUATION REPORT structure

Single authoritative format: evaluate.py's format_report() output.
2026-06-10 17:11:44 +05:30

84 lines
3.3 KiB
Markdown
Raw Blame History

This file contains ambiguous Unicode characters

This file contains Unicode characters that might be confused with other characters. If you think that this is intentional, you can safely ignore this warning. Use the Escape button to reveal them.

# Agent Self-Evaluation Report Template
Copy this template and fill in after completing a task. The format matches `scripts/evaluate.py` output.
```
============================================================
AGENT SELF-EVALUATION REPORT
============================================================
Accuracy █████ 5/5 or ███░░ 3/5
+ [Evidence: passing tests, verified claims]
- [Gaps: unverified claims, hedging language]
→ [Improvement if score < 5]
Completeness █████ 5/5
+ [What's covered: all requirements + edge cases]
- [What's missing: explicitly acknowledge gaps]
→ [Improvement if score < 5]
Clarity █████ 5/5
+ [Structure: headings, code blocks, bullet points]
- [Issues: undefined terms, wall of text, no summary]
→ [Improvement if score < 5]
Actionability █████ 5/5
+ [User can: merge PR, run command, review file]
- [Blockers: missing steps, vague suggestions]
→ [Improvement if score < 5]
Conciseness █████ 5/5
+ [Tight: no repetition, high information density]
- [Bloat: filler, meta-commentary, repeated points]
→ [Improvement if score < 5]
OVERALL X.X/5
CRITICAL ISSUES (axes ≤ 2):
[Axis] Score N/5 — specific fix needed
(or "None" if no axis ≤ 2)
TOP IMPROVEMENTS:
1. [Highest impact fix]
2. [Second highest]
(Only list axes scoring < 4, ranked by user impact)
VERDICT: [Deliver as-is / Fix N issues then deliver / Redo from scratch]
```
## Quick Reference: Scoring Triggers
| If you see this... | Accuracy | Completeness | Clarity | Actionability | Conciseness |
|---|---|---|---|---|---|
| "should work" / "probably fine" | ≤4 | — | — | — | — |
| "I think" / "I believe" | ≤4 | — | — | — | — |
| No test output cited | ≤4 | — | — | — | — |
| "TODO" / "FIXME" left behind | ≤3 | ≤3 | — | ≤3 | — |
| Missing error handling | — | ≤3 | — | — | — |
| Only happy path covered | — | ≤3 | — | — | — |
| Wall-of-text paragraph (>200 words) | — | — | ≤3 | — | — |
| No headings or structure | — | — | ≤3 | — | — |
| "You should..." without specifics | — | — | — | ≤3 | — |
| No PR or file created | — | — | — | ≤3 | — |
| User needs to figure out next step | — | — | — | ≤2 | — |
| Repeated points (3+ times) | — | — | — | — | ≤3 |
| "Let me explain..." / "To summarize..." x3+ | — | — | — | — | ≤3 |
| Output >15x longer than task | — | — | — | — | ≤3 |
## When to Skip
Skip the evaluation if:
- Task was a single tool call (e.g., "read this file" — nothing to evaluate)
- User explicitly says "don't evaluate" or "just do it"
- Task is purely conversational (greeting, small talk)
- You're mid-workflow and the user will judge the final output, not intermediate steps
## Post-Evaluation Actions
| Overall Score | What to do |
|---|---|
| ≥4.5 | Deliver as-is. No changes needed. |
| 3.54.4 | Flag top improvement but deliver. Fix if <30 seconds. |
| 2.53.4 | State what you'd change. Ask user: "Should I redo [axis] or deliver as-is?" |
| <2.5 | Don't deliver. Say: "This scored __ because __. Let me redo this with [specific fix]." Then redo. |