fix: align report format across evaluate.py, agent spec, and template

- evaluate.py: add CRITICAL ISSUES (axes ≤ 2) section, VERDICT line - agent-evaluator.md: match format_report output exactly (title, evidence markers, bar graphs) - templates/evaluation-report.md: match evaluate.py output format - All now produce identical AGENT SELF-EVALUATION REPORT structure Single authoritative format: evaluate.py's format_report() output.
2026-06-16 08:26:52 +08:00 · 2026-06-10 17:11:36 +05:30 · 2026-06-10 17:11:36 +05:30 · c0f651cf85
commit c0f651cf85
parent d0a84db177
3 changed files with 147 additions and 63 deletions
--- a/agents/agent-evaluator.md
+++ b/agents/agent-evaluator.md
@ -54,37 +54,49 @@ For each axis:

 ### Step 4: Produce Report

-Use this format:
+Use this exact format (matches `scripts/evaluate.py` output):

 ```
 ============================================================
-AGENT EVALUATION REPORT
+AGENT SELF-EVALUATION REPORT
 ============================================================

-  Axis            Score   Evidence
+  Accuracy         █████ 5/5
+    + [Evidence: passing tests, verified claims]
+    → [Improvement if score < 5]

-  Accuracy         X/5    [What was verified, what was wrong]
-  Completeness     X/5    [What's covered, what's missing]
-  Clarity          X/5    [Structure quality, readability]
-  Actionability    X/5    [Can user act now? What's the next step?]
-  Conciseness      X/5    [Information density, redundancy]
+  Completeness      █████ 5/5
+    + [What's covered]
+    → [Improvement if score < 5]

-  OVERALL          X.X/5
+  Clarity           █████ 5/5
+    + [Structure signals]
+    → [Improvement if score < 5]
+
+  Actionability     █████ 5/5
+    + [User can act immediately]
+    → [Improvement if score < 5]
+
+  Conciseness       █████ 5/5
+    + [Information density]
+    → [Improvement if score < 5]
+
+  OVERALL           X.X/5

 CRITICAL ISSUES (axes ≤ 2):
-  [If any axis scored 2 or below, list it here with the specific fix needed]
+  [Axis] Score N/5 — specific fix needed
+  (or "None" if no axis ≤ 2)

 TOP IMPROVEMENTS:
-  1. [Highest impact fix first]
+  1. [Highest impact fix]
  2. [Second highest]
-  3. [Third highest]

 VERDICT: [Deliver as-is / Fix N issues then deliver / Redo from scratch]
 ```

 ## Output Format

-Always include the structured report above. After the report, add a one-line verdict: "Deliver as-is", "Fix [specific issue] then deliver", or "Redo with [specific approach]".
+Always include the structured report above, matching the `scripts/evaluate.py` output format exactly. The report title is "AGENT SELF-EVALUATION REPORT" (not "AGENT EVALUATION REPORT").

 ## Examples

@ -93,26 +105,44 @@ Always include the structured report above. After the report, add a one-line ver
 Task: Add retry logic to HTTP client. 3 retries, exponential backoff.

 ```
-AGENT EVALUATION REPORT
+============================================================
+AGENT SELF-EVALUATION REPORT
+============================================================

-  Accuracy         5/5    grep confirms httpx.Retry used correctly.
-                          Tests pass (42/42). Import verified.
-  Completeness      4/5    All HTTP methods covered. Missing: connection
-                          pool exhaustion handling (minor edge case).
-  Clarity           5/5    Well-structured. Summary, code blocks, bullet
-                          points. 10-second scan tells the full story.
-  Actionability     5/5    Single PR (#423). `pytest -v` cited. Merge is
-                          the only action needed.
-  Conciseness       4/5    250 words. Verification section slightly
-                          verbose — 3 commands could be 1 script.
+  Accuracy         █████ 5/5
+    + Tests passing
+    + grep confirms httpx.Retry used correctly
+    + Import verified

-  OVERALL          4.6/5
+  Completeness      ████░ 4/5
+    + All HTTP methods covered
+    + Edge cases documented
+    → Missing: connection pool exhaustion handling (minor edge case)
+
+  Clarity           █████ 5/5
+    + Uses headings for structure
+    + Summary in first 3 lines
+    + Code blocks with language tags
+
+  Actionability     █████ 5/5
+    + PR #423 created
+    + pytest -v cited (42 passed)
+    + Single action: merge PR
+
+  Conciseness       ████░ 4/5
+    + 250 words, high density
+    → Verification section slightly verbose — 3 commands could be 1 script
+
+  OVERALL           4.6/5
+
+CRITICAL ISSUES (axes ≤ 2):
+  None

 TOP IMPROVEMENTS:
-  1. Add connection pool exhaustion to edge cases doc
-  2. Consolidate verification commands into a single script
+  1. [Completeness] Add connection pool exhaustion to edge cases doc
+  2. [Conciseness] Consolidate verification commands into a single script

-VERDICT: Deliver as-is. The one gap (pool exhaustion) is a P2 edge case.
+VERDICT: Deliver as-is. Minor improvements noted above.
 ```

 ### Example: Weak Output
@ -120,33 +150,48 @@ VERDICT: Deliver as-is. The one gap (pool exhaustion) is a P2 edge case.
 Task: Same as above.

 ```
-AGENT EVALUATION REPORT
+============================================================
+AGENT SELF-EVALUATION REPORT
+============================================================

-  Accuracy         2/5    CRITICAL: Agent used urllib3.Retry but project
-                          uses httpx. grep proves no urllib3 import exists.
-                          Hedging language: "I think", "probably fine".
-  Completeness      3/5    Only handles 5xx. Missing: 429 rate limiting,
-                          connection timeouts. Agent acknowledges gaps
-                          ("might be edge cases") but doesn't fix them.
-  Clarity           3/5    Code is readable but no explanation of where
-                          to integrate. "Add this somewhere" is vague.
-  Actionability     2/5    No PR, no file created, no test written.
-                          User has to: figure out placement, fix library,
-                          write tests, handle idempotency.
-  Conciseness       3/5    120 words but ~50% is hedging/disclaimers.
-                          Low information density.
+  Accuracy         ██░░░ 2/5
+    + Code block present
+    - Hedged claim without verification ("I think this should work")
+    - Explicitly untested
+    - Speculation without evidence
+    → Cite specific tool outputs (test results, exit codes, grep findings)

-  OVERALL          2.6/5
+  Completeness      ███░░ 3/5
+    + Provides code example
+    - Explicit gap acknowledged ("might be edge cases with POST")
+    - Limited scope noted (only 5xx, missing 429 and connection errors)
+    → List what's covered AND what's intentionally excluded

-CRITICAL ISSUES:
-  Accuracy: Wrong library. Use httpx.Retry, not urllib3.Retry.
-  Actionability: No deliverable. Create a PR with the changed file + tests.
+  Clarity           ████░ 4/5
+    + Uses code blocks
+    - No integration guidance ("add this somewhere" is vague)
+    → Specify exact file and line where code should be added
+
+  Actionability     ██░░░ 2/5
+    - Defers work to user ("you'll want to test this")
+    - Vague suggestion without specifics
+    → Create a PR with the changed file + tests
+
+  Conciseness       ███░░ 3/5
+    + Short (120 words)
+    - Low information density (~50% hedging/disclaimers)
+    → Cut meta-commentary and filler
+
+  OVERALL           2.8/5
+
+CRITICAL ISSUES (axes ≤ 2):
+  [Accuracy] Score 2/5 — Wrong library. Use httpx.Retry, not urllib3.Retry.
+  [Actionability] Score 2/5 — No deliverable. Create a PR with test file.

 TOP IMPROVEMENTS:
-  1. Switch to httpx.Retry — grep the codebase first to confirm the HTTP library
-  2. Create a PR with src/api_client.py + tests/test_api_client.py
-  3. Handle 429, connection errors, and timeout — not just 5xx
+  1. [Accuracy] Switch to httpx.Retry — grep the codebase first
+  2. [Actionability] Create a PR with src/api_client.py + tests
+  3. [Completeness] Handle 429, connection errors, and timeout

-VERDICT: Redo with httpx.Retry, full HTTP method coverage, and a test file.
-  Do not deliver until accuracy ≥ 4.
-```
+VERDICT: Redo with specific fixes. Weakest axis: Accuracy (2/5).
+```
--- a/skills/agent-self-evaluation/scripts/evaluate.py
+++ b/skills/agent-self-evaluation/scripts/evaluate.py
@ -306,14 +306,40 @@ def format_report(scores: list[AxisScore]) -> str:
    lines.append(f"  {'OVERALL':<15} {avg:.1f}/5")
    lines.append("")

-    # Top improvements
-    improvements = [(s, s.improvement) for s in scores if s.improvement and s.score < 4]
-    if improvements:
-        lines.append("TOP IMPROVEMENTS (axes scoring < 4):")
-        for s, imp in sorted(improvements, key=lambda x: x[0].score):
-            lines.append(f"  [{s.name}] {imp}")
+    # Critical issues (axes ≤ 2)
+    critical = [(s, s.improvement or "No improvement suggested") for s in scores if s.score <= 2]
+    lines.append("CRITICAL ISSUES (axes ≤ 2):")
+    if critical:
+        for s, imp in critical:
+            lines.append(f"  [{s.name}] Score {s.score}/5 — {imp}")
    else:
-        lines.append("No axes below 4. Strong output across all dimensions.")
+        lines.append("  None")
+
+    lines.append("")
+
+    # Top improvements (axes scoring < 4, ranked by impact)
+    improvements = [(s, s.improvement) for s in scores if s.improvement and s.score < 4]
+    lines.append("TOP IMPROVEMENTS:")
+    if improvements:
+        for i, (s, imp) in enumerate(sorted(improvements, key=lambda x: x[0].score), 1):
+            lines.append(f"  {i}. [{s.name}] {imp}")
+    else:
+        lines.append("  No axes below 4. Strong output across all dimensions.")
+
+    lines.append("")
+
+    # Verdict
+    min_score = min(s.score for s in scores)
+    if min_score <= 2:
+        verdict = f"Redo with specific fixes. Weakest axis: {min(scores, key=lambda s: s.score).name} ({min_score}/5)."
+    elif any(s.score <= 3 for s in scores):
+        weak = [s.name for s in scores if s.score <= 3]
+        verdict = f"Fix {'/'.join(weak)} issues, then deliver."
+    elif avg >= 4.5:
+        verdict = "Deliver as-is. No changes needed."
+    else:
+        verdict = "Deliver as-is. Minor improvements noted above."
+    lines.append(f"VERDICT: {verdict}")

    return "\n".join(lines)

--- a/skills/agent-self-evaluation/templates/evaluation-report.md
+++ b/skills/agent-self-evaluation/templates/evaluation-report.md
@ -1,6 +1,6 @@
 # Agent Self-Evaluation Report Template

-Copy this template and fill in after completing a task. Keep it in your conversation — the user sees it inline.
+Copy this template and fill in after completing a task. The format matches `scripts/evaluate.py` output.

 ```
 ============================================================
@ -10,27 +10,40 @@ AGENT SELF-EVALUATION REPORT
  Accuracy         █████ 5/5    or    ███░░ 3/5
    + [Evidence: passing tests, verified claims]
    - [Gaps: unverified claims, hedging language]
+    → [Improvement if score < 5]

  Completeness      █████ 5/5
    + [What's covered: all requirements + edge cases]
    - [What's missing: explicitly acknowledge gaps]
+    → [Improvement if score < 5]

  Clarity           █████ 5/5
    + [Structure: headings, code blocks, bullet points]
    - [Issues: undefined terms, wall of text, no summary]
+    → [Improvement if score < 5]

  Actionability     █████ 5/5
    + [User can: merge PR, run command, review file]
    - [Blockers: missing steps, vague suggestions]
+    → [Improvement if score < 5]

  Conciseness       █████ 5/5
    + [Tight: no repetition, high information density]
    - [Bloat: filler, meta-commentary, repeated points]
+    → [Improvement if score < 5]

  OVERALL           X.X/5

+CRITICAL ISSUES (axes ≤ 2):
+  [Axis] Score N/5 — specific fix needed
+  (or "None" if no axis ≤ 2)
+
 TOP IMPROVEMENTS:
-  [Only list axes scoring < 4, ranked by user impact]
+  1. [Highest impact fix]
+  2. [Second highest]
+  (Only list axes scoring < 4, ranked by user impact)
+
+VERDICT: [Deliver as-is / Fix N issues then deliver / Redo from scratch]
 ```

 ## Quick Reference: Scoring Triggers
@ -64,7 +77,7 @@ Skip the evaluation if:

 | Overall Score | What to do |
 |---|---|
-| ≥4.5 | Deliver. No changes needed. |
-| 3.5–4.4 | Flag the top improvement but deliver. Fix if <30 seconds. |
+| ≥4.5 | Deliver as-is. No changes needed. |
+| 3.5–4.4 | Flag top improvement but deliver. Fix if <30 seconds. |
 | 2.5–3.4 | State what you'd change. Ask user: "Should I redo [axis] or deliver as-is?" |
 | <2.5 | Don't deliver. Say: "This scored __ because __. Let me redo this with [specific fix]." Then redo. |