fix: align report format across evaluate.py, agent spec, and template

- evaluate.py: add CRITICAL ISSUES (axes ≤ 2) section, VERDICT line - agent-evaluator.md: match format_report output exactly (title, evidence markers, bar graphs) - templates/evaluation-report.md: match evaluate.py output format - All now produce identical AGENT SELF-EVALUATION REPORT structure Single authoritative format: evaluate.py's format_report() output.
2026-06-16 16:36:53 +08:00 · 2026-06-10 17:11:36 +05:30 · 2026-06-10 17:11:36 +05:30 · c0f651cf85
commit c0f651cf85
parent d0a84db177
3 changed files with 147 additions and 63 deletions
--- a/agents/agent-evaluator.md
+++ b/agents/agent-evaluator.md
@ -54,37 +54,49 @@ For each axis:
 ### Step 4: Produce Report
-Use this format:
+Use this exact format (matches `scripts/evaluate.py` output):
 ```
 ============================================================
-AGENT EVALUATION REPORT
+AGENT SELF-EVALUATION REPORT
 ============================================================
-  Axis            Score   Evidence
+  Accuracy         █████ 5/5
    + [Evidence: passing tests, verified claims]
    → [Improvement if score < 5]
-  Accuracy         X/5    [What was verified, what was wrong]
+  Completeness      █████ 5/5
-  Completeness     X/5    [What's covered, what's missing]
+    + [What's covered]
-  Clarity          X/5    [Structure quality, readability]
+    → [Improvement if score < 5]
  Actionability    X/5    [Can user act now? What's the next step?]
  Conciseness      X/5    [Information density, redundancy]
-  OVERALL          X.X/5
+  Clarity           █████ 5/5
    + [Structure signals]
    → [Improvement if score < 5]
  Actionability     █████ 5/5
    + [User can act immediately]
    → [Improvement if score < 5]
  Conciseness       █████ 5/5
    + [Information density]
    → [Improvement if score < 5]
  OVERALL           X.X/5
 CRITICAL ISSUES (axes ≤ 2):
-  [If any axis scored 2 or below, list it here with the specific fix needed]
+  [Axis] Score N/5 — specific fix needed
  (or "None" if no axis ≤ 2)
 TOP IMPROVEMENTS:
-  1. [Highest impact fix first]
+  1. [Highest impact fix]
  2. [Second highest]
  3. [Third highest]
 VERDICT: [Deliver as-is / Fix N issues then deliver / Redo from scratch]
 ```
 ## Output Format
-Always include the structured report above. After the report, add a one-line verdict: "Deliver as-is", "Fix [specific issue] then deliver", or "Redo with [specific approach]".
+Always include the structured report above, matching the `scripts/evaluate.py` output format exactly. The report title is "AGENT SELF-EVALUATION REPORT" (not "AGENT EVALUATION REPORT").
 ## Examples
@ -93,26 +105,44 @@ Always include the structured report above. After the report, add a one-line ver
 Task: Add retry logic to HTTP client. 3 retries, exponential backoff.
 ```
-AGENT EVALUATION REPORT
+============================================================
 AGENT SELF-EVALUATION REPORT
 ============================================================
-  Accuracy         5/5    grep confirms httpx.Retry used correctly.
+  Accuracy         █████ 5/5
-                          Tests pass (42/42). Import verified.
+    + Tests passing
-  Completeness      4/5    All HTTP methods covered. Missing: connection
+    + grep confirms httpx.Retry used correctly
-                          pool exhaustion handling (minor edge case).
+    + Import verified
  Clarity           5/5    Well-structured. Summary, code blocks, bullet
                          points. 10-second scan tells the full story.
  Actionability     5/5    Single PR (#423). `pytest -v` cited. Merge is
                          the only action needed.
  Conciseness       4/5    250 words. Verification section slightly
                          verbose — 3 commands could be 1 script.
-  OVERALL          4.6/5
+  Completeness      ████░ 4/5
    + All HTTP methods covered
    + Edge cases documented
    → Missing: connection pool exhaustion handling (minor edge case)
  Clarity           █████ 5/5
    + Uses headings for structure
    + Summary in first 3 lines
    + Code blocks with language tags
  Actionability     █████ 5/5
    + PR #423 created
    + pytest -v cited (42 passed)
    + Single action: merge PR
  Conciseness       ████░ 4/5
    + 250 words, high density
    → Verification section slightly verbose — 3 commands could be 1 script
  OVERALL           4.6/5
 CRITICAL ISSUES (axes ≤ 2):
  None
 TOP IMPROVEMENTS:
-  1. Add connection pool exhaustion to edge cases doc
+  1. [Completeness] Add connection pool exhaustion to edge cases doc
-  2. Consolidate verification commands into a single script
+  2. [Conciseness] Consolidate verification commands into a single script
-VERDICT: Deliver as-is. The one gap (pool exhaustion) is a P2 edge case.
+VERDICT: Deliver as-is. Minor improvements noted above.
 ```
 ### Example: Weak Output
@ -120,33 +150,48 @@ VERDICT: Deliver as-is. The one gap (pool exhaustion) is a P2 edge case.
 Task: Same as above.
 ```
-AGENT EVALUATION REPORT
+============================================================
 AGENT SELF-EVALUATION REPORT
 ============================================================
-  Accuracy         2/5    CRITICAL: Agent used urllib3.Retry but project
+  Accuracy         ██░░░ 2/5
-                          uses httpx. grep proves no urllib3 import exists.
+    + Code block present
-                          Hedging language: "I think", "probably fine".
+    - Hedged claim without verification ("I think this should work")
-  Completeness      3/5    Only handles 5xx. Missing: 429 rate limiting,
+    - Explicitly untested
-                          connection timeouts. Agent acknowledges gaps
+    - Speculation without evidence
-                          ("might be edge cases") but doesn't fix them.
+    → Cite specific tool outputs (test results, exit codes, grep findings)
  Clarity           3/5    Code is readable but no explanation of where
                          to integrate. "Add this somewhere" is vague.
  Actionability     2/5    No PR, no file created, no test written.
                          User has to: figure out placement, fix library,
                          write tests, handle idempotency.
  Conciseness       3/5    120 words but ~50% is hedging/disclaimers.
                          Low information density.
-  OVERALL          2.6/5
+  Completeness      ███░░ 3/5
    + Provides code example
    - Explicit gap acknowledged ("might be edge cases with POST")
    - Limited scope noted (only 5xx, missing 429 and connection errors)
    → List what's covered AND what's intentionally excluded
-CRITICAL ISSUES:
+  Clarity           ████░ 4/5
-  Accuracy: Wrong library. Use httpx.Retry, not urllib3.Retry.
+    + Uses code blocks
-  Actionability: No deliverable. Create a PR with the changed file + tests.
+    - No integration guidance ("add this somewhere" is vague)
    → Specify exact file and line where code should be added
  Actionability     ██░░░ 2/5
    - Defers work to user ("you'll want to test this")
    - Vague suggestion without specifics
    → Create a PR with the changed file + tests
  Conciseness       ███░░ 3/5
    + Short (120 words)
    - Low information density (~50% hedging/disclaimers)
    → Cut meta-commentary and filler
  OVERALL           2.8/5
 CRITICAL ISSUES (axes ≤ 2):
  [Accuracy] Score 2/5 — Wrong library. Use httpx.Retry, not urllib3.Retry.
  [Actionability] Score 2/5 — No deliverable. Create a PR with test file.
 TOP IMPROVEMENTS:
-  1. Switch to httpx.Retry — grep the codebase first to confirm the HTTP library
+  1. [Accuracy] Switch to httpx.Retry — grep the codebase first
-  2. Create a PR with src/api_client.py + tests/test_api_client.py
+  2. [Actionability] Create a PR with src/api_client.py + tests
-  3. Handle 429, connection errors, and timeout — not just 5xx
+  3. [Completeness] Handle 429, connection errors, and timeout
-VERDICT: Redo with httpx.Retry, full HTTP method coverage, and a test file.
+VERDICT: Redo with specific fixes. Weakest axis: Accuracy (2/5).
-  Do not deliver until accuracy ≥ 4.
+```
 ```
--- a/skills/agent-self-evaluation/scripts/evaluate.py
+++ b/skills/agent-self-evaluation/scripts/evaluate.py
@ -306,14 +306,40 @@ def format_report(scores: list[AxisScore]) -> str:
    lines.append(f"  {'OVERALL':<15} {avg:.1f}/5")
    lines.append("")
-    # Top improvements
+    # Critical issues (axes ≤ 2)
-    improvements = [(s, s.improvement) for s in scores if s.improvement and s.score < 4]
+    critical = [(s, s.improvement or "No improvement suggested") for s in scores if s.score <= 2]
-    if improvements:
+    lines.append("CRITICAL ISSUES (axes ≤ 2):")
-        lines.append("TOP IMPROVEMENTS (axes scoring < 4):")
+    if critical:
-        for s, imp in sorted(improvements, key=lambda x: x[0].score):
+        for s, imp in critical:
-            lines.append(f"  [{s.name}] {imp}")
+            lines.append(f"  [{s.name}] Score {s.score}/5 — {imp}")
    else:
-        lines.append("No axes below 4. Strong output across all dimensions.")
+        lines.append("  None")
    lines.append("")
    # Top improvements (axes scoring < 4, ranked by impact)
    improvements = [(s, s.improvement) for s in scores if s.improvement and s.score < 4]
    lines.append("TOP IMPROVEMENTS:")
    if improvements:
        for i, (s, imp) in enumerate(sorted(improvements, key=lambda x: x[0].score), 1):
            lines.append(f"  {i}. [{s.name}] {imp}")
    else:
        lines.append("  No axes below 4. Strong output across all dimensions.")
    lines.append("")
    # Verdict
    min_score = min(s.score for s in scores)
    if min_score <= 2:
        verdict = f"Redo with specific fixes. Weakest axis: {min(scores, key=lambda s: s.score).name} ({min_score}/5)."
    elif any(s.score <= 3 for s in scores):
        weak = [s.name for s in scores if s.score <= 3]
        verdict = f"Fix {'/'.join(weak)} issues, then deliver."
    elif avg >= 4.5:
        verdict = "Deliver as-is. No changes needed."
    else:
        verdict = "Deliver as-is. Minor improvements noted above."
    lines.append(f"VERDICT: {verdict}")
    return "\n".join(lines)
--- a/skills/agent-self-evaluation/templates/evaluation-report.md
+++ b/skills/agent-self-evaluation/templates/evaluation-report.md
@ -1,6 +1,6 @@
 # Agent Self-Evaluation Report Template
-Copy this template and fill in after completing a task. Keep it in your conversation — the user sees it inline.
+Copy this template and fill in after completing a task. The format matches `scripts/evaluate.py` output.
 ```
 ============================================================
@ -10,27 +10,40 @@ AGENT SELF-EVALUATION REPORT
  Accuracy         █████ 5/5    or    ███░░ 3/5
    + [Evidence: passing tests, verified claims]
    - [Gaps: unverified claims, hedging language]
    → [Improvement if score < 5]
  Completeness      █████ 5/5
    + [What's covered: all requirements + edge cases]
    - [What's missing: explicitly acknowledge gaps]
    → [Improvement if score < 5]
  Clarity           █████ 5/5
    + [Structure: headings, code blocks, bullet points]
    - [Issues: undefined terms, wall of text, no summary]
    → [Improvement if score < 5]
  Actionability     █████ 5/5
    + [User can: merge PR, run command, review file]
    - [Blockers: missing steps, vague suggestions]
    → [Improvement if score < 5]
  Conciseness       █████ 5/5
    + [Tight: no repetition, high information density]
    - [Bloat: filler, meta-commentary, repeated points]
    → [Improvement if score < 5]
  OVERALL           X.X/5
 CRITICAL ISSUES (axes ≤ 2):
  [Axis] Score N/5 — specific fix needed
  (or "None" if no axis ≤ 2)
 TOP IMPROVEMENTS:
-  [Only list axes scoring < 4, ranked by user impact]
+  1. [Highest impact fix]
  2. [Second highest]
  (Only list axes scoring < 4, ranked by user impact)
 VERDICT: [Deliver as-is / Fix N issues then deliver / Redo from scratch]
 ```
 ## Quick Reference: Scoring Triggers
@ -64,7 +77,7 @@ Skip the evaluation if:
 | Overall Score | What to do |
 |---|---|
-| ≥4.5 | Deliver. No changes needed. |
+| ≥4.5 | Deliver as-is. No changes needed. |
-| 3.5–4.4 | Flag the top improvement but deliver. Fix if <30 seconds. |
+| 3.5–4.4 | Flag top improvement but deliver. Fix if <30 seconds. |
 | 2.5–3.4 | State what you'd change. Ask user: "Should I redo [axis] or deliver as-is?" |
 | <2.5 | Don't deliver. Say: "This scored __ because __. Let me redo this with [specific fix]." Then redo. |