mirror of
https://github.com/affaan-m/everything-claude-code.git
synced 2026-06-16 16:36:53 +08:00
fix: align report format across evaluate.py, agent spec, and template
- evaluate.py: add CRITICAL ISSUES (axes ≤ 2) section, VERDICT line - agent-evaluator.md: match format_report output exactly (title, evidence markers, bar graphs) - templates/evaluation-report.md: match evaluate.py output format - All now produce identical AGENT SELF-EVALUATION REPORT structure Single authoritative format: evaluate.py's format_report() output.
This commit is contained in:
parent
d0a84db177
commit
c0f651cf85
@ -54,37 +54,49 @@ For each axis:
|
|||||||
|
|
||||||
### Step 4: Produce Report
|
### Step 4: Produce Report
|
||||||
|
|
||||||
Use this format:
|
Use this exact format (matches `scripts/evaluate.py` output):
|
||||||
|
|
||||||
```
|
```
|
||||||
============================================================
|
============================================================
|
||||||
AGENT EVALUATION REPORT
|
AGENT SELF-EVALUATION REPORT
|
||||||
============================================================
|
============================================================
|
||||||
|
|
||||||
Axis Score Evidence
|
Accuracy █████ 5/5
|
||||||
|
+ [Evidence: passing tests, verified claims]
|
||||||
|
→ [Improvement if score < 5]
|
||||||
|
|
||||||
Accuracy X/5 [What was verified, what was wrong]
|
Completeness █████ 5/5
|
||||||
Completeness X/5 [What's covered, what's missing]
|
+ [What's covered]
|
||||||
Clarity X/5 [Structure quality, readability]
|
→ [Improvement if score < 5]
|
||||||
Actionability X/5 [Can user act now? What's the next step?]
|
|
||||||
Conciseness X/5 [Information density, redundancy]
|
|
||||||
|
|
||||||
OVERALL X.X/5
|
Clarity █████ 5/5
|
||||||
|
+ [Structure signals]
|
||||||
|
→ [Improvement if score < 5]
|
||||||
|
|
||||||
|
Actionability █████ 5/5
|
||||||
|
+ [User can act immediately]
|
||||||
|
→ [Improvement if score < 5]
|
||||||
|
|
||||||
|
Conciseness █████ 5/5
|
||||||
|
+ [Information density]
|
||||||
|
→ [Improvement if score < 5]
|
||||||
|
|
||||||
|
OVERALL X.X/5
|
||||||
|
|
||||||
CRITICAL ISSUES (axes ≤ 2):
|
CRITICAL ISSUES (axes ≤ 2):
|
||||||
[If any axis scored 2 or below, list it here with the specific fix needed]
|
[Axis] Score N/5 — specific fix needed
|
||||||
|
(or "None" if no axis ≤ 2)
|
||||||
|
|
||||||
TOP IMPROVEMENTS:
|
TOP IMPROVEMENTS:
|
||||||
1. [Highest impact fix first]
|
1. [Highest impact fix]
|
||||||
2. [Second highest]
|
2. [Second highest]
|
||||||
3. [Third highest]
|
|
||||||
|
|
||||||
VERDICT: [Deliver as-is / Fix N issues then deliver / Redo from scratch]
|
VERDICT: [Deliver as-is / Fix N issues then deliver / Redo from scratch]
|
||||||
```
|
```
|
||||||
|
|
||||||
## Output Format
|
## Output Format
|
||||||
|
|
||||||
Always include the structured report above. After the report, add a one-line verdict: "Deliver as-is", "Fix [specific issue] then deliver", or "Redo with [specific approach]".
|
Always include the structured report above, matching the `scripts/evaluate.py` output format exactly. The report title is "AGENT SELF-EVALUATION REPORT" (not "AGENT EVALUATION REPORT").
|
||||||
|
|
||||||
## Examples
|
## Examples
|
||||||
|
|
||||||
@ -93,26 +105,44 @@ Always include the structured report above. After the report, add a one-line ver
|
|||||||
Task: Add retry logic to HTTP client. 3 retries, exponential backoff.
|
Task: Add retry logic to HTTP client. 3 retries, exponential backoff.
|
||||||
|
|
||||||
```
|
```
|
||||||
AGENT EVALUATION REPORT
|
============================================================
|
||||||
|
AGENT SELF-EVALUATION REPORT
|
||||||
|
============================================================
|
||||||
|
|
||||||
Accuracy 5/5 grep confirms httpx.Retry used correctly.
|
Accuracy █████ 5/5
|
||||||
Tests pass (42/42). Import verified.
|
+ Tests passing
|
||||||
Completeness 4/5 All HTTP methods covered. Missing: connection
|
+ grep confirms httpx.Retry used correctly
|
||||||
pool exhaustion handling (minor edge case).
|
+ Import verified
|
||||||
Clarity 5/5 Well-structured. Summary, code blocks, bullet
|
|
||||||
points. 10-second scan tells the full story.
|
|
||||||
Actionability 5/5 Single PR (#423). `pytest -v` cited. Merge is
|
|
||||||
the only action needed.
|
|
||||||
Conciseness 4/5 250 words. Verification section slightly
|
|
||||||
verbose — 3 commands could be 1 script.
|
|
||||||
|
|
||||||
OVERALL 4.6/5
|
Completeness ████░ 4/5
|
||||||
|
+ All HTTP methods covered
|
||||||
|
+ Edge cases documented
|
||||||
|
→ Missing: connection pool exhaustion handling (minor edge case)
|
||||||
|
|
||||||
|
Clarity █████ 5/5
|
||||||
|
+ Uses headings for structure
|
||||||
|
+ Summary in first 3 lines
|
||||||
|
+ Code blocks with language tags
|
||||||
|
|
||||||
|
Actionability █████ 5/5
|
||||||
|
+ PR #423 created
|
||||||
|
+ pytest -v cited (42 passed)
|
||||||
|
+ Single action: merge PR
|
||||||
|
|
||||||
|
Conciseness ████░ 4/5
|
||||||
|
+ 250 words, high density
|
||||||
|
→ Verification section slightly verbose — 3 commands could be 1 script
|
||||||
|
|
||||||
|
OVERALL 4.6/5
|
||||||
|
|
||||||
|
CRITICAL ISSUES (axes ≤ 2):
|
||||||
|
None
|
||||||
|
|
||||||
TOP IMPROVEMENTS:
|
TOP IMPROVEMENTS:
|
||||||
1. Add connection pool exhaustion to edge cases doc
|
1. [Completeness] Add connection pool exhaustion to edge cases doc
|
||||||
2. Consolidate verification commands into a single script
|
2. [Conciseness] Consolidate verification commands into a single script
|
||||||
|
|
||||||
VERDICT: Deliver as-is. The one gap (pool exhaustion) is a P2 edge case.
|
VERDICT: Deliver as-is. Minor improvements noted above.
|
||||||
```
|
```
|
||||||
|
|
||||||
### Example: Weak Output
|
### Example: Weak Output
|
||||||
@ -120,33 +150,48 @@ VERDICT: Deliver as-is. The one gap (pool exhaustion) is a P2 edge case.
|
|||||||
Task: Same as above.
|
Task: Same as above.
|
||||||
|
|
||||||
```
|
```
|
||||||
AGENT EVALUATION REPORT
|
============================================================
|
||||||
|
AGENT SELF-EVALUATION REPORT
|
||||||
|
============================================================
|
||||||
|
|
||||||
Accuracy 2/5 CRITICAL: Agent used urllib3.Retry but project
|
Accuracy ██░░░ 2/5
|
||||||
uses httpx. grep proves no urllib3 import exists.
|
+ Code block present
|
||||||
Hedging language: "I think", "probably fine".
|
- Hedged claim without verification ("I think this should work")
|
||||||
Completeness 3/5 Only handles 5xx. Missing: 429 rate limiting,
|
- Explicitly untested
|
||||||
connection timeouts. Agent acknowledges gaps
|
- Speculation without evidence
|
||||||
("might be edge cases") but doesn't fix them.
|
→ Cite specific tool outputs (test results, exit codes, grep findings)
|
||||||
Clarity 3/5 Code is readable but no explanation of where
|
|
||||||
to integrate. "Add this somewhere" is vague.
|
|
||||||
Actionability 2/5 No PR, no file created, no test written.
|
|
||||||
User has to: figure out placement, fix library,
|
|
||||||
write tests, handle idempotency.
|
|
||||||
Conciseness 3/5 120 words but ~50% is hedging/disclaimers.
|
|
||||||
Low information density.
|
|
||||||
|
|
||||||
OVERALL 2.6/5
|
Completeness ███░░ 3/5
|
||||||
|
+ Provides code example
|
||||||
|
- Explicit gap acknowledged ("might be edge cases with POST")
|
||||||
|
- Limited scope noted (only 5xx, missing 429 and connection errors)
|
||||||
|
→ List what's covered AND what's intentionally excluded
|
||||||
|
|
||||||
CRITICAL ISSUES:
|
Clarity ████░ 4/5
|
||||||
Accuracy: Wrong library. Use httpx.Retry, not urllib3.Retry.
|
+ Uses code blocks
|
||||||
Actionability: No deliverable. Create a PR with the changed file + tests.
|
- No integration guidance ("add this somewhere" is vague)
|
||||||
|
→ Specify exact file and line where code should be added
|
||||||
|
|
||||||
|
Actionability ██░░░ 2/5
|
||||||
|
- Defers work to user ("you'll want to test this")
|
||||||
|
- Vague suggestion without specifics
|
||||||
|
→ Create a PR with the changed file + tests
|
||||||
|
|
||||||
|
Conciseness ███░░ 3/5
|
||||||
|
+ Short (120 words)
|
||||||
|
- Low information density (~50% hedging/disclaimers)
|
||||||
|
→ Cut meta-commentary and filler
|
||||||
|
|
||||||
|
OVERALL 2.8/5
|
||||||
|
|
||||||
|
CRITICAL ISSUES (axes ≤ 2):
|
||||||
|
[Accuracy] Score 2/5 — Wrong library. Use httpx.Retry, not urllib3.Retry.
|
||||||
|
[Actionability] Score 2/5 — No deliverable. Create a PR with test file.
|
||||||
|
|
||||||
TOP IMPROVEMENTS:
|
TOP IMPROVEMENTS:
|
||||||
1. Switch to httpx.Retry — grep the codebase first to confirm the HTTP library
|
1. [Accuracy] Switch to httpx.Retry — grep the codebase first
|
||||||
2. Create a PR with src/api_client.py + tests/test_api_client.py
|
2. [Actionability] Create a PR with src/api_client.py + tests
|
||||||
3. Handle 429, connection errors, and timeout — not just 5xx
|
3. [Completeness] Handle 429, connection errors, and timeout
|
||||||
|
|
||||||
VERDICT: Redo with httpx.Retry, full HTTP method coverage, and a test file.
|
VERDICT: Redo with specific fixes. Weakest axis: Accuracy (2/5).
|
||||||
Do not deliver until accuracy ≥ 4.
|
```
|
||||||
```
|
|
||||||
@ -306,14 +306,40 @@ def format_report(scores: list[AxisScore]) -> str:
|
|||||||
lines.append(f" {'OVERALL':<15} {avg:.1f}/5")
|
lines.append(f" {'OVERALL':<15} {avg:.1f}/5")
|
||||||
lines.append("")
|
lines.append("")
|
||||||
|
|
||||||
# Top improvements
|
# Critical issues (axes ≤ 2)
|
||||||
improvements = [(s, s.improvement) for s in scores if s.improvement and s.score < 4]
|
critical = [(s, s.improvement or "No improvement suggested") for s in scores if s.score <= 2]
|
||||||
if improvements:
|
lines.append("CRITICAL ISSUES (axes ≤ 2):")
|
||||||
lines.append("TOP IMPROVEMENTS (axes scoring < 4):")
|
if critical:
|
||||||
for s, imp in sorted(improvements, key=lambda x: x[0].score):
|
for s, imp in critical:
|
||||||
lines.append(f" [{s.name}] {imp}")
|
lines.append(f" [{s.name}] Score {s.score}/5 — {imp}")
|
||||||
else:
|
else:
|
||||||
lines.append("No axes below 4. Strong output across all dimensions.")
|
lines.append(" None")
|
||||||
|
|
||||||
|
lines.append("")
|
||||||
|
|
||||||
|
# Top improvements (axes scoring < 4, ranked by impact)
|
||||||
|
improvements = [(s, s.improvement) for s in scores if s.improvement and s.score < 4]
|
||||||
|
lines.append("TOP IMPROVEMENTS:")
|
||||||
|
if improvements:
|
||||||
|
for i, (s, imp) in enumerate(sorted(improvements, key=lambda x: x[0].score), 1):
|
||||||
|
lines.append(f" {i}. [{s.name}] {imp}")
|
||||||
|
else:
|
||||||
|
lines.append(" No axes below 4. Strong output across all dimensions.")
|
||||||
|
|
||||||
|
lines.append("")
|
||||||
|
|
||||||
|
# Verdict
|
||||||
|
min_score = min(s.score for s in scores)
|
||||||
|
if min_score <= 2:
|
||||||
|
verdict = f"Redo with specific fixes. Weakest axis: {min(scores, key=lambda s: s.score).name} ({min_score}/5)."
|
||||||
|
elif any(s.score <= 3 for s in scores):
|
||||||
|
weak = [s.name for s in scores if s.score <= 3]
|
||||||
|
verdict = f"Fix {'/'.join(weak)} issues, then deliver."
|
||||||
|
elif avg >= 4.5:
|
||||||
|
verdict = "Deliver as-is. No changes needed."
|
||||||
|
else:
|
||||||
|
verdict = "Deliver as-is. Minor improvements noted above."
|
||||||
|
lines.append(f"VERDICT: {verdict}")
|
||||||
|
|
||||||
return "\n".join(lines)
|
return "\n".join(lines)
|
||||||
|
|
||||||
|
|||||||
@ -1,6 +1,6 @@
|
|||||||
# Agent Self-Evaluation Report Template
|
# Agent Self-Evaluation Report Template
|
||||||
|
|
||||||
Copy this template and fill in after completing a task. Keep it in your conversation — the user sees it inline.
|
Copy this template and fill in after completing a task. The format matches `scripts/evaluate.py` output.
|
||||||
|
|
||||||
```
|
```
|
||||||
============================================================
|
============================================================
|
||||||
@ -10,27 +10,40 @@ AGENT SELF-EVALUATION REPORT
|
|||||||
Accuracy █████ 5/5 or ███░░ 3/5
|
Accuracy █████ 5/5 or ███░░ 3/5
|
||||||
+ [Evidence: passing tests, verified claims]
|
+ [Evidence: passing tests, verified claims]
|
||||||
- [Gaps: unverified claims, hedging language]
|
- [Gaps: unverified claims, hedging language]
|
||||||
|
→ [Improvement if score < 5]
|
||||||
|
|
||||||
Completeness █████ 5/5
|
Completeness █████ 5/5
|
||||||
+ [What's covered: all requirements + edge cases]
|
+ [What's covered: all requirements + edge cases]
|
||||||
- [What's missing: explicitly acknowledge gaps]
|
- [What's missing: explicitly acknowledge gaps]
|
||||||
|
→ [Improvement if score < 5]
|
||||||
|
|
||||||
Clarity █████ 5/5
|
Clarity █████ 5/5
|
||||||
+ [Structure: headings, code blocks, bullet points]
|
+ [Structure: headings, code blocks, bullet points]
|
||||||
- [Issues: undefined terms, wall of text, no summary]
|
- [Issues: undefined terms, wall of text, no summary]
|
||||||
|
→ [Improvement if score < 5]
|
||||||
|
|
||||||
Actionability █████ 5/5
|
Actionability █████ 5/5
|
||||||
+ [User can: merge PR, run command, review file]
|
+ [User can: merge PR, run command, review file]
|
||||||
- [Blockers: missing steps, vague suggestions]
|
- [Blockers: missing steps, vague suggestions]
|
||||||
|
→ [Improvement if score < 5]
|
||||||
|
|
||||||
Conciseness █████ 5/5
|
Conciseness █████ 5/5
|
||||||
+ [Tight: no repetition, high information density]
|
+ [Tight: no repetition, high information density]
|
||||||
- [Bloat: filler, meta-commentary, repeated points]
|
- [Bloat: filler, meta-commentary, repeated points]
|
||||||
|
→ [Improvement if score < 5]
|
||||||
|
|
||||||
OVERALL X.X/5
|
OVERALL X.X/5
|
||||||
|
|
||||||
|
CRITICAL ISSUES (axes ≤ 2):
|
||||||
|
[Axis] Score N/5 — specific fix needed
|
||||||
|
(or "None" if no axis ≤ 2)
|
||||||
|
|
||||||
TOP IMPROVEMENTS:
|
TOP IMPROVEMENTS:
|
||||||
[Only list axes scoring < 4, ranked by user impact]
|
1. [Highest impact fix]
|
||||||
|
2. [Second highest]
|
||||||
|
(Only list axes scoring < 4, ranked by user impact)
|
||||||
|
|
||||||
|
VERDICT: [Deliver as-is / Fix N issues then deliver / Redo from scratch]
|
||||||
```
|
```
|
||||||
|
|
||||||
## Quick Reference: Scoring Triggers
|
## Quick Reference: Scoring Triggers
|
||||||
@ -64,7 +77,7 @@ Skip the evaluation if:
|
|||||||
|
|
||||||
| Overall Score | What to do |
|
| Overall Score | What to do |
|
||||||
|---|---|
|
|---|---|
|
||||||
| ≥4.5 | Deliver. No changes needed. |
|
| ≥4.5 | Deliver as-is. No changes needed. |
|
||||||
| 3.5–4.4 | Flag the top improvement but deliver. Fix if <30 seconds. |
|
| 3.5–4.4 | Flag top improvement but deliver. Fix if <30 seconds. |
|
||||||
| 2.5–3.4 | State what you'd change. Ask user: "Should I redo [axis] or deliver as-is?" |
|
| 2.5–3.4 | State what you'd change. Ask user: "Should I redo [axis] or deliver as-is?" |
|
||||||
| <2.5 | Don't deliver. Say: "This scored __ because __. Let me redo this with [specific fix]." Then redo. |
|
| <2.5 | Don't deliver. Say: "This scored __ because __. Let me redo this with [specific fix]." Then redo. |
|
||||||
|
|||||||
Loading…
x
Reference in New Issue
Block a user