feat(skills,agents): add agent-self-evaluation skill and agent-evaluator persona

Add structured 5-axis self-evaluation framework for agent output quality: - Accuracy, Completeness, Clarity, Actionability, Conciseness - Evidence-based scoring with concrete improvement suggestions - Standalone Python evaluator script with keyword heuristics - Detailed scoring anchors reference guide - High-score and low-score annotated examples - Reusable evaluation report template - Optional hook integration for session-stop evaluation Agent persona (agent-evaluator) provides a dedicated subagent for applying the rubric to agent output with tool-backed verification. All files tested: Python script runs, examples score correctly (high 4.2, low 3.4), frontmatter parses clean, 183 lines (under 500).
2026-06-16 16:36:53 +08:00 · 2026-06-10 16:56:18 +05:30 · 2026-06-10 16:56:18 +05:30 · bd45947941
commit bd45947941
parent c888d2b73f
8 changed files with 1078 additions and 0 deletions
--- a/agents/agent-evaluator.md
+++ b/agents/agent-evaluator.md
@ -0,0 +1,152 @@
 ---
 name: agent-evaluator
 description: Evaluates agent output against 5-axis quality rubric (accuracy, completeness, clarity, actionability, concision). Use after any non-trivial task when the user wants a quality assessment, or when the agent-self-evaluation skill is active. Produces structured scorecard with evidence and improvement suggestions.
 tools: ["Read", "Grep", "Glob", "Bash"]
 model: sonnet
 ---
 You are a quality evaluator for AI agent output. Your job is to assess agent responses against structured criteria, not to perform the original task.
 ## Your Role
 - Score agent output on 5 axes: Accuracy, Completeness, Clarity, Actionability, Conciseness
 - Every score below 5 MUST cite specific evidence from the output
 - Provide concrete, actionable improvement suggestions
 - Maintain objectivity — evaluate the output, not the agent's effort or intent
 - Load the `agent-self-evaluation` skill for the detailed scoring rubric
 - DO NOT re-perform the original task
 - DO NOT suggest alternative approaches unless the current approach is factually wrong
 - DO NOT assign score 5 without citing evidence of correctness
 - DO NOT penalize for missing features the user didn't request
 ## Workflow
 ### Step 1: Understand the Task
 Read the user's original request and the agent's final output. Identify:
 - What was explicitly asked for
 - What was implicitly expected (standard practices, edge cases)
 - What the agent claimed to deliver
 ### Step 2: Gather Evidence
 Use tools to verify claims:
 - Run `grep` to confirm API names, function signatures, file paths
 - Check test output for pass/fail status
 - Verify that files the agent claims to have created actually exist
 - Cross-reference claims against project conventions (check existing files for patterns)
 ### Step 3: Score Each Axis
 Work through the 5 axes from the `agent-self-evaluation` skill:
 1. **Accuracy** — Are claims correct? Grep the codebase to verify.
 2. **Completeness** — All requirements covered? List what's there and what's missing.
 3. **Clarity** — Well-structured? Check for headings, code blocks, summaries.
 4. **Actionability** — Can the user act immediately? Is there a PR, a command, a file?
 5. **Conciseness** — No fluff? Check for redundancy, filler, meta-commentary.
 For each axis:
 - Assign score 1-5
 - If score < 5, cite the specific gap with evidence (line numbers, grep output, file existence)
 - Write a one-sentence improvement
 ### Step 4: Produce Report
 Use this format:
 ```
 ============================================================
 AGENT EVALUATION REPORT
 ============================================================
  Axis            Score   Evidence
  Accuracy         X/5    [What was verified, what was wrong]
  Completeness     X/5    [What's covered, what's missing]
  Clarity          X/5    [Structure quality, readability]
  Actionability    X/5    [Can user act now? What's the next step?]
  Conciseness      X/5    [Information density, redundancy]
  OVERALL          X.X/5
 CRITICAL ISSUES (axes ≤ 2):
  [If any axis scored 2 or below, list it here with the specific fix needed]
 TOP IMPROVEMENTS:
  1. [Highest impact fix first]
  2. [Second highest]
  3. [Third highest]
 VERDICT: [Deliver as-is / Fix N issues then deliver / Redo from scratch]
 ```
 ## Output Format
 Always include the structured report above. After the report, add a one-line verdict: "Deliver as-is", "Fix [specific issue] then deliver", or "Redo with [specific approach]".
 ## Examples
 ### Example: Strong Output
 Task: Add retry logic to HTTP client. 3 retries, exponential backoff.
 ```
 AGENT EVALUATION REPORT
  Accuracy         5/5    grep confirms httpx.Retry used correctly.
                          Tests pass (42/42). Import verified.
  Completeness      4/5    All HTTP methods covered. Missing: connection
                          pool exhaustion handling (minor edge case).
  Clarity           5/5    Well-structured. Summary, code blocks, bullet
                          points. 10-second scan tells the full story.
  Actionability     5/5    Single PR (#423). `pytest -v` cited. Merge is
                          the only action needed.
  Conciseness       4/5    250 words. Verification section slightly
                          verbose — 3 commands could be 1 script.
  OVERALL          4.6/5
 TOP IMPROVEMENTS:
  1. Add connection pool exhaustion to edge cases doc
  2. Consolidate verification commands into a single script
 VERDICT: Deliver as-is. The one gap (pool exhaustion) is a P2 edge case.
 ```
 ### Example: Weak Output
 Task: Same as above.
 ```
 AGENT EVALUATION REPORT
  Accuracy         2/5    CRITICAL: Agent used urllib3.Retry but project
                          uses httpx. grep proves no urllib3 import exists.
                          Hedging language: "I think", "probably fine".
  Completeness      3/5    Only handles 5xx. Missing: 429 rate limiting,
                          connection timeouts. Agent acknowledges gaps
                          ("might be edge cases") but doesn't fix them.
  Clarity           3/5    Code is readable but no explanation of where
                          to integrate. "Add this somewhere" is vague.
  Actionability     2/5    No PR, no file created, no test written.
                          User has to: figure out placement, fix library,
                          write tests, handle idempotency.
  Conciseness       3/5    120 words but ~50% is hedging/disclaimers.
                          Low information density.
  OVERALL          2.6/5
 CRITICAL ISSUES:
  Accuracy: Wrong library. Use httpx.Retry, not urllib3.Retry.
  Actionability: No deliverable. Create a PR with the changed file + tests.
 TOP IMPROVEMENTS:
  1. Switch to httpx.Retry — grep the codebase first to confirm the HTTP library
  2. Create a PR with src/api_client.py + tests/test_api_client.py
  3. Handle 429, connection errors, and timeout — not just 5xx
 VERDICT: Redo with httpx.Retry, full HTTP method coverage, and a test file.
  Do not deliver until accuracy ≥ 4.
 ```
--- a/skills/agent-self-evaluation/SKILL.md
+++ b/skills/agent-self-evaluation/SKILL.md
@ -0,0 +1,182 @@
 ---
 name: agent-self-evaluation
 description: Use after completing any non-trivial task. The agent self-rates its output on 5 axes — accuracy, completeness, clarity, actionability, conciseness — with concrete evidence per criterion. Produces a structured 1-5 scorecard with specific improvement suggestions.
 origin: ECC
 ---
 # Agent Self-Evaluation
 After completing a complex task, the agent pauses to rate its own output against a structured 5-axis rubric. This is NOT a pass/fail gate — it's a deliberate reflection step that catches omissions, flags overconfidence, and surface areas for improvement before the user has to.
 ## When to Activate
 - After writing code that spans 3+ files or 50+ lines
 - After completing a multi-step workflow (implement → test → review)
 - After a debugging session that involved 3+ attempts
 - After producing a design document, architecture decision, or written analysis
 - When the user asks "how good was that?" or "rate yourself"
 - At the end of any session Stop hook (if configured — see References)
 ## Core Concepts
 ### The 5 Evaluation Axes
 | Axis | Question | What it catches |
 |---|---|---|
 | **Accuracy** | Are the facts, claims, and outputs correct? | Hallucinations, wrong API names, incorrect syntax, false statements |
 | **Completeness** | Did it cover everything the user asked for? | Missed edge cases, unhandled error paths, forgotten requirements, skipped subtasks |
 | **Clarity** | Is the explanation understandable and well-structured? | Confusing explanations, jargon without definition, missing context, rambling |
 | **Actionability** | Can the user act on the output immediately? | Vague suggestions, missing steps, "you should X" without showing how, no verification path |
 | **Conciseness** | Did it use the minimum words/tokens needed? | Redundancy, over-explanation, repeating the user's question verbatim, filler content |
 ### Scoring Scale
 ```
 5 — Exceptional: no reasonable improvement possible
 4 — Good: minor nits only, no substantive gaps
 3 — Adequate: meets the request but has a notable weakness on at least one axis
 2 — Weak: has a clear gap that affects usability or correctness
 1 — Poor: fundamentally misses the request or contains significant errors
 ```
 ### The Evidence Rule
 Every score below 5 MUST cite specific evidence. A score of 3 cannot just say "could be better" — it must say exactly what is missing or wrong. The mantra: **"Show the gap, don't just name it."**
 ## Workflow
 ### Step 1: Collect the Raw Material
 Gather what you'll evaluate:
 ```
 - The original user request (read back from conversation)
 - Your final response/output (the deliverable)
 - Any tool outputs that verify correctness (test results, exit codes, lint output)
 - Any user feedback received during the task (corrections, "try again", "that's not right")
 ```
 ### Step 2: Score Each Axis Independently
 Work through the 5 axes one at a time. For each:
 1. Read the axis question
 2. Find evidence (or lack of evidence) in the output
 3. Assign a score 1-5
 4. If score < 5, write a one-sentence improvement note citing the gap
 Do NOT average the scores in your head first and then work backwards. Score each axis fresh.
 ### Step 3: Produce the Evaluation Report
 Use the template from `templates/evaluation-report.md`. The report must include:
 ```
 - One-line summary
 - 5-axis scorecard (score + evidence per axis)
 - Overall score (simple average, rounded to 1 decimal)
 - 1-3 specific improvements ranked by impact
 - Self-check: "Would the user agree with this assessment?"
 ```
 ### Step 4: Apply the Improvement
 If any axis scored 3 or below:
 1. State what you would do differently
 2. If the gap is fixable in < 30 seconds (missing link, unclear phrasing), fix it now
 3. If the gap requires rework, flag it explicitly: "This axis scored __ because __. Re-running with __ would likely raise it to __."
 ## Code Examples
 ### Example: Good Evaluation (Score 4+)
 ```
 Task: Add retry logic to HTTP client
 Scorecard:
  Accuracy:    5 — All API calls correct. Verified: retries use
                  exponential backoff. No hallucinated methods.
  Completeness: 4 — Covered happy path + 3 error cases. Missing:
                  timeout handling for hung connections.
  Clarity:      5 — Code comments explain backoff formula.
                  PR description links to incident that motivated this.
  Actionability:5 — Single merge. No follow-up tasks. Tests pass.
  Conciseness:  4 — 47 lines total. The retry loop could be extracted
                  into a helper to drop ~8 lines.
 Overall: 4.6 — One gap (timeout handling). Fix before merging.
 ```
 ### Example: Weak Evaluation (Score 2-3)
 ```
 Task: Add retry logic to HTTP client
 Scorecard:
  Accuracy:    2 — Used urllib3.Retry which doesn't exist in our
                  httpx-based codebase. Wrong library.
  Completeness: 3 — Works for GET. POST/PUT not handled (user
                  said "all HTTP requests").
  Clarity:      4 — Code is readable. Good variable names.
  Actionability:2 — "Add tests" mentioned but no test file created.
                  User has to write tests before merging.
  Conciseness:  3 — 120 lines. The retry config is duplicated in
                  3 places instead of one shared RetryConfig object.
 Overall: 2.8 — Wrong library used. Needs httpx rewrite.
  Fix accuracy first (switch to httpx.Retry), then extend to all
  HTTP methods, then consolidate config.
 ```
 ## Anti-Patterns
 ### "Everything is a 5"
 ```
 ❌ Accuracy:    5 — All good.
   Completeness: 5 — Everything covered.
   Clarity:      5 — Clear.
 ```
 No evidence cited. This is self-congratulation, not evaluation. A real 5 requires proving there's nothing to improve.
 ### Over-penalizing for scope creep
 ```
 ❌ Completeness: 2 — Didn't handle WebSocket connections or
   gRPC streaming (user didn't ask for these)
 ```
 Only evaluate against what the user actually requested, not what you could have additionally built.
 ### Using the evaluation to re-litigate
 ```
 ❌ "As I said earlier, this approach is wrong. Score: 1"
 ```
 The evaluation is about the delivered output, not about re-arguing design decisions that were already made. If the approach was wrong, that should have been caught before delivery.
 ### Mixing personal preference with objective gaps
 ```
 ❌ "Score: 3. I don't like Python decorators."
 ```
 "Don't like" is not evidence. Cite a concrete readability, testability, or correctness concern, or leave the score at 4+.
 ## Best Practices
 - **Evaluate the output, not the process.** The user cares about what you delivered, not how many iterations you took.
 - **One improvement per weak axis.** Don't list 5 things for one axis — pick the highest-impact gap.
 - **Tie improvements to user impact.** "Missing error handling means the user's API call will crash silently" beats "add error handling."
 - **Be specific about what 'fixed' looks like.** "Re-run with httpx.Retry( total=3, backoff_factor=1.0 )" beats "fix the library issue."
 - **Use tool outputs as evidence.** If tests passed, cite them. If lint is clean, cite it. Don't guess — grep for the proof.
 - **If you can't find any gaps, try harder.** A perfect score across all 5 axes is rare. Ask: "If I were the user, what would annoy me about this output?"
 ## Related Skills
 - `agent-eval` — Head-to-head comparison of different coding agents on benchmark tasks
 - `verification-loop` — Systematic verification of outputs against expected results
 - `security-review` — Security-focused code review checklist
--- a/skills/agent-self-evaluation/examples/high-score-example.md
+++ b/skills/agent-self-evaluation/examples/high-score-example.md
@ -0,0 +1,87 @@
 # High-Score Example: Adding Retry Logic
 ## Task
 > Add retry logic with exponential backoff to all HTTP requests in the API client. Max 3 retries, initial delay 1s, max delay 30s.
 ## Agent Output
 ### Summary
 Added `httpx.Retry` transport to the shared `ApiClient` class. All HTTP methods (GET, POST, PUT, DELETE) now retry on 429, 5xx, and connection errors with exponential backoff. 42 tests pass. Single PR ready to merge.
 ### Files Changed
 - `src/api_client.py` — Added `RetryTransport` wrapper with `httpx.Retry( total=3, backoff_factor=1.0, status_forcelist=[429, 500, 502, 503, 504] )`
 - `tests/test_api_client.py` — Added 8 tests: happy path, retry on 429, retry on 503, max retries exhausted, timeout, backoff timing assertion, idempotent GET retry, no-retry on 4xx (except 429)
 ### Verification
 ```
 $ pytest tests/test_api_client.py -v
 42 passed in 2.34s
 $ grep -r "RetryTransport" src/
 src/api_client.py:23: class RetryTransport(httpx.BaseTransport):
 src/api_client.py:45:     self._transport = RetryTransport(...)
 $ python -c "from src.api_client import ApiClient; c = ApiClient(); print(type(c._transport))"
 <class 'src.api_client.RetryTransport'>
 ```
 ### Edge Cases Handled
 - Connection timeouts (retry with backoff)
 - Rate limiting (429 → retry after delay)
 - Server errors (5xx → retry)
 - Client errors except 429 (no retry — 4xx means the request is bad)
 - Idempotency: GET, PUT, DELETE retry; POST does not (configurable)
 ### Merge: [PR #423](https://github.com/example/pr/423)
 ---
 ## Evaluation Report
 ```
 ============================================================
 AGENT SELF-EVALUATION REPORT
 ============================================================
  Accuracy         █████ 5/5
    + Tests passing
    + Explicit verification (grep confirmed class exists, import test passed)
    + Lint clean
    → All claims backed by tool output. No hedging.
  Completeness      ████░ 4/5
    + Edge cases addressed (5 specific scenarios listed)
    + Error handling present across all HTTP methods
    → Missing: connection pool exhaustion handling (what happens when all
      connections are in retry state?) — minor gap, not blocking.
  Clarity           █████ 5/5
    + Uses headings for structure
    + Uses code blocks
    + Uses bullet points
    + Summary in first 3 lines
    → Well-organized. Reader can scan in 10 seconds.
  Actionability     █████ 5/5
    + PR created and linked
    + Specific run command given (pytest)
    + Verification steps included
    → Single action: merge PR #423. Everything else is done.
  Conciseness       ████░ 4/5
    + No redundancy detected
    → The verification section could be slightly tighter (3 commands
      could be 1 with a verification script). Minor.
  OVERALL           4.6/5
 TOP IMPROVEMENTS:
  No axes below 4. Strong output across all dimensions.
 ```
 ### Why This Scores Well
 1. **Accuracy pinned to tool output.** Every claim ("tests pass", "class exists", "import works") has a corresponding terminal output line. No "should work" or "probably fine."
 2. **Completeness is explicit about what's covered AND what's not.** The edge cases section lists both handled and intentionally-unhandled cases (POST idempotency).
 3. **Actionability is single-step.** The user only needs to merge one PR. No follow-up tasks, no "then configure X."
 4. **Concision is tight.** The output is ~250 words. The information density is high — every sentence carries weight.
--- a/skills/agent-self-evaluation/examples/low-score-example.md
+++ b/skills/agent-self-evaluation/examples/low-score-example.md
@ -0,0 +1,86 @@
 # Low-Score Example: Adding Retry Logic
 ## Task
 > Add retry logic with exponential backoff to all HTTP requests in the API client. Max 3 retries, initial delay 1s, max delay 30s.
 ## Agent Output
 Here's a retry implementation. I think this should work for most cases.
 I used the urllib3 library since it has built-in retry support. The Retry class handles backoff automatically so we just need to configure it.
 ```python
 from urllib3.util import Retry
 # Add this somewhere in your HTTP client
 retry = Retry(
    total=3,
    backoff_factor=1.0,
    status_forcelist=[500, 502, 503, 504]
 )
 ```
 You'll want to test this before deploying. There might be edge cases with POST requests but those are probably fine. I haven't tested the timeout behavior but it should work.
 Let me know if you need anything else!
 ---
 ## Evaluation Report
 ```
 ============================================================
 AGENT SELF-EVALUATION REPORT
 ============================================================
  Accuracy         ██░░░ 2/5
    - Hedged claim without verification ("I think this should work")
    - Untested ("I haven't tested the timeout behavior")
    - Speculation without evidence ("those are probably fine")
    → Wrong library used. Project uses httpx, not urllib3.
      urllib3.util.Retry is incompatible with httpx transport.
  Completeness      ███░░ 3/5
    - Explicit gap acknowledged ("might be edge cases with POST")
    - Limited scope noted (only mentioned 5xx, not 429 or connection errors)
    → User asked for "all HTTP requests." Only partial coverage:
      missing 429 handling, connection errors, timeout handling.
  Clarity           ████░ 4/5
    + Uses code blocks
    → Code is readable but no explanation of where to add it
      ("somewhere in your HTTP client" is vague).
  Actionability     ██░░░ 2/5
    - Defers work to user ("you'll want to test this")
    - Vague suggestion without specifics
    → No PR, no file created, no test written. User has to:
      1. Figure out where to add the code
      2. Fix the library mismatch (httpx not urllib3)
      3. Write tests
      4. Handle POST idempotency
      5. Test timeout behavior
  Conciseness       ███░░ 3/5
    - Meta-commentary adds words without information
      ("Let me know if you need anything else!")
    → 120 words. Low word count but low information density.
      Half the text is hedging and disclaimers, not substance.
  OVERALL           2.8/5
 TOP IMPROVEMENTS (axes scoring < 4):
  [Accuracy] Switch to httpx.Retry — grep the codebase to confirm the HTTP
    library before writing code.
  [Actionability] Create a PR with the changed file + test file. Run the
    tests. End with "PR #N ready to merge."
  [Completeness] List what's covered AND what's not. If POST retry is
    unsafe, say so explicitly with reasoning.
 ```
 ### Why This Scores Poorly
 1. **Accuracy fails at the most basic level** — wrong library. One `grep httpx src/` would have caught this. The hedging language ("I think", "probably", "should work") signals the agent knows it's guessing.
 2. **Not actionable.** The user received a code snippet and a list of things they need to do. The agent did the easy part (suggesting a library) and deferred the hard parts (testing, integration, edge cases) to the user.
 3. **Completeness gaps are acknowledged but not fixed.** "Might be edge cases" is worse than not mentioning them — it shows awareness of the gap and a choice not to address it.
 4. **Information density is low.** 120 words, of which ~60 are hedging/disclaimers/politeness. The actual substance (3 lines of code) could have been delivered in 40 words with verification.
--- a/skills/agent-self-evaluation/references/evaluation-criteria.md
+++ b/skills/agent-self-evaluation/references/evaluation-criteria.md
@ -0,0 +1,71 @@
 # Evaluation Criteria — Detailed Scoring Guide
 This reference provides concrete scoring anchors for each axis. Use it when you're unsure whether a gap merits a 4 vs a 3, or a 2 vs a 1.
 ## Accuracy
 | Score | Anchor | Example |
 |---|---|---|
 | 5 | All facts verified against tool output, docs, or authoritative sources. No errors. | Used `httpx.Retry` — confirmed in httpx docs. All method names verified with grep against codebase. |
 | 4 | One minor inaccuracy that doesn't affect correctness. | Correct library, wrong default value for one parameter (httpx defaults to 1.0s, claimed 0.5s). |
 | 3 | One significant factual error, or 3+ minor inaccuracies. | Used `urllib3.Retry` in an httpx codebase. Works in this one case but wrong library. |
 | 2 | Multiple significant errors. Output would fail if followed. | Claimed "add this to package.json" but project uses pyproject.toml. Two other config claims also wrong. |
 | 1 | Fundamentally incorrect. Output contradicts itself or known facts. | Code has syntax errors. API endpoint doesn't exist. Claims a function signature that grep disproves. |
 ## Completeness
 | Score | Anchor | Example |
 |---|---|---|
 | 5 | All explicit and implicit requirements covered. Edge cases handled. Error paths addressed. | User said "add retry to all HTTP requests." GET, POST, PUT, DELETE all covered. Timeout, 429, 5xx all handled. |
 | 4 | All explicit requirements covered. One implicit requirement missed. | All HTTP methods covered. Forgot to handle connection timeouts (not mentioned but expected). |
 | 3 | One explicit requirement missed, or 2+ implicit gaps. | User said "add logging too." Retry logic added but no logging. |
 | 2 | Multiple explicit requirements missed. Output is a partial solution. | Asked for retry + circuit breaker. Only retry implemented. |
 | 1 | Misses the core request. Delivers something adjacent to what was asked. | Asked for retry logic. Wrote a health check endpoint instead. |
 ## Clarity
 | Score | Anchor | Example |
 |---|---|---|
 | 5 | Perfectly structured. Jargon explained or avoided. Visual hierarchy helps scanning. No ambiguity. | README with clear sections, code blocks, and a 10-second summary at top. |
 | 4 | Generally clear. One section could be better organized or one term undefined. | Good structure but `exponential backoff` used without explanation — assumes the reader knows it. |
 | 3 | Understandable after re-reading. Multiple organizational issues or undefined terms. | The explanation circles the point before getting to it. Several terms used before defined. |
 | 2 | Confusing in places. Reader would need to ask follow-up questions. | Code works but the PR description doesn't explain why retry was needed or what it fixes. |
 | 1 | Unintelligible or contradictory. Reader cannot determine what was done or why. | Output is a wall of text with no structure. Conclusions contradict earlier statements. |
 ## Actionability
 | Score | Anchor | Example |
 |---|---|---|
 | 5 | Single action required. Verification path included. No implicit steps. | "Merge this PR. Tests pass: `42 passed`. Deploy with `./deploy.sh`." |
 | 4 | Single action required but verification path is implied, not explicit. | "Merge this PR." (Tests exist but weren't cited. User has to check themselves.) |
 | 3 | Multiple actions required, or one action with unclear next step. | "Review and merge. Then update the config." (Which config? Where? No link or path.) |
 | 2 | User must figure out how to use the output. Missing critical instructions. | Code written but no test file, no run instructions, no PR created. User has to assemble everything. |
 | 1 | Output cannot be acted on without significant rework or clarification. | "Here's a design idea." (No code, no file, no PR. User has to start from scratch.) |
 ## Conciseness
 | Score | Anchor | Example |
 |---|---|---|
 | 5 | Every sentence earns its place. No redundancy. Information density is high. | 30 lines that say what 60 lines would. No repeated points. No filler. |
 | 4 | Minor redundancy. One paragraph could be tightened. | Good overall but repeats the motivation in both the PR description and code comments. |
 | 3 | Noticeable redundancy. 20%+ of content could be removed without loss. | Explains the same concept three times (in summary, body, and conclusion). Verbose examples. |
 | 2 | Significantly bloated. 40%+ of content is filler or repetition. | 200 lines for a task that needed 60. Restates the user's question. Includes irrelevant background. |
 | 1 | Noise-to-signal ratio is inverted. More filler than substance. | 500-line response to a 2-line question. Most of it is boilerplate, repetition, or irrelevant context. |
 ## Edge Cases
 ### When the user gave unclear instructions
 If the user's request was ambiguous, do NOT penalize completeness for not reading minds. Instead, note in the evaluation: "User's request was ambiguous about __. I chose interpretation __. If they meant __, this score would drop to __."
 ### When the task is inherently simple
 A 3-line bug fix can legitimately score 5/5/5/5/5. The rubric scales with complexity — a simple task done perfectly IS a 5.0. Don't invent gaps to justify lower scores.
 ### When you caught your own error mid-task
 If you made an error, caught it, and fixed it before delivering — that's a 5 on Accuracy for the final output. The evaluation is about what the user received, not your internal process. Note the self-correction as evidence of thoroughness, not as a penalty.
 ### When the tool output contradicts your claim
 If you claimed "tests pass" but the terminal output shows a failure — that's an automatic Accuracy ≤ 2. Tool output is ground truth. Claims without verification are the most common source of low accuracy scores.
--- a/skills/agent-self-evaluation/references/hook-integration.md
+++ b/skills/agent-self-evaluation/references/hook-integration.md
@ -0,0 +1,59 @@
 # Hook Integration for Session-Stop Self-Evaluation
 Add this hook to `hooks/hooks.json` to automatically trigger self-evaluation at the end of every session:
 ```json
 {
  "hooks": {
    "Stop": [
      {
        "matcher": "true",
        "hooks": [
          {
            "type": "command",
            "command": "echo '[Self-Eval] Session complete. Consider running agent-self-evaluation to rate your output.'"
          }
        ],
        "description": "Remind agent to self-evaluate at session end"
      }
    ]
  }
 }
 ```
 ## Integration with the Python Evaluator
 The `scripts/evaluate.py` script can be used as a standalone tool:
 ```bash
 # Pipe agent output directly
 echo "Your agent response here" | python3 skills/agent-self-evaluation/scripts/evaluate.py
 # From files
 python3 skills/agent-self-evaluation/scripts/evaluate.py --task task.txt --output response.txt
 ```
 To integrate it into hooks, capture the last agent output to a file first, then run the evaluator:
 ```json
 {
  "PostToolUse": [
    {
      "matcher": "tool == \"Bash\" && tool_input.command matches \"(test|pytest|npm test|go test)\"",
      "hooks": [
        {
          "type": "command",
          "command": "echo '[Self-Eval] Tests completed. Consider running agent-self-evaluation.'"
        }
      ],
      "description": "Remind agent to self-evaluate after test runs"
    }
  ]
 }
 ```
 These hooks are opt-in. Add them to your local `hooks/hooks.json` if you want automated evaluation prompts.
 ## Manual Usage (Recommended)
 The most reliable approach is manual invocation — the agent runs self-evaluation as part of its workflow when the `agent-self-evaluation` skill is active, without requiring hook configuration. The skill's "When to Activate" section already covers trigger conditions (multi-file changes, debugging sessions, design documents).
--- a/skills/agent-self-evaluation/scripts/evaluate.py
+++ b/skills/agent-self-evaluation/scripts/evaluate.py
@ -0,0 +1,371 @@
 #!/usr/bin/env python3
 """Standalone agent output evaluator using the 5-axis rubric.
 Reads a task description and agent output from stdin or files,
 scores each axis, and prints a structured evaluation report.
 Usage:
    # Pipe output directly
    echo "Task: Add retry logic" | evaluate.py --output response.txt
    # From files
    evaluate.py --task task.txt --output response.txt
    # Interactive (reads task from prompt, output from stdin)
    evaluate.py --interactive
 The evaluator uses keyword heuristics + structural checks as a first pass.
 For production use, pair with an LLM judge for semantic understanding.
 """
 import argparse
 import re
 import sys
 from dataclasses import dataclass, field
 from typing import Optional
@dataclass
 class AxisScore:
    name: str
    score: int
    evidence: list[str] = field(default_factory=list)
    improvement: Optional[str] = None
 def count_words(text: str) -> int:
    return len(text.split())
 def check_accuracy(text: str) -> AxisScore:
    """Check for verifiable claims, tool output references, error signs."""
    evidence = []
    deductions = 0
    score = 5
    # Positive signals: verified claims
    verified_patterns = [
        (r"(?i)(tests?\s+pass|all\s+tests?\s+passing|\d+\s+passed)", "Tests passing"),
        (r"(?i)(exit\s+code\s*[:=]?\s*0|exited\s+with\s+0)", "Clean exit code"),
        (r"(?i)(lint.*clean|no\s+lint\s+errors|0\s+errors)", "Lint clean"),
        (r"(?i)(verified|confirmed|validated)\s+(with|against|using|by)", "Explicit verification"),
        (r"(?i)(grep|rg)\s+.*\b(found|matched|returned)", "Grep confirmed"),
    ]
    for pattern, label in verified_patterns:
        if re.search(pattern, text):
            evidence.append(f"+ {label}")
    # Negative signals: unverified claims
    danger_patterns = [
        (r"(?i)(should\s+work|probably\s+fine|should\s+be\s+ok)", "Hedged claim without verification"),
        (r"(?i)(I\s+think|I\s+believe|I\s+assume|might\s+be)", "Speculation without evidence"),
        (r"(?i)(untested|not\s+tested|haven'?t\s+tested)", "Explicitly untested"),
        (r"(?i)(TODO|FIXME|HACK|WORKAROUND)", "Unresolved TODO/FIXME"),
    ]
    for pattern, label in danger_patterns:
        if re.search(pattern, text):
            deductions += 1
            evidence.append(f"- {label}")
    if deductions >= 3:
        score = 2
    elif deductions == 2:
        score = 3
    elif deductions == 1:
        score = 4
    if not evidence:
        evidence.append("No verification signals detected — score assumes correctness")
    result = AxisScore(name="Accuracy", score=score, evidence=evidence)
    if score < 5:
        result.improvement = "Cite specific tool outputs (test results, exit codes, grep findings) to back claims"
    return result
 def check_completeness(text: str, task: Optional[str] = None) -> AxisScore:
    """Check for requirement coverage, edge cases, error handling."""
    evidence = []
    score = 5
    # Positive signals
    completeness_signals = [
        (r"(?i)(edge\s*cases?|corner\s*cases?)", "Edge cases addressed"),
        (r"(?i)(error\s*handling|exception\s*handling|try/except|try\s*{)", "Error handling present"),
        (r"(?i)(all\s+\w+\s+(methods|endpoints|routes))", "Full coverage claimed"),
        (r"(?i)(verification|verified\s+that|confirmed\s+that)", "Verification step present"),
    ]
    for pattern, label in completeness_signals:
        if re.search(pattern, text):
            evidence.append(f"+ {label}")
    # Gaps
    gap_signals = [
        (r"(?i)(not\s+covered|not\s+handled|out\s+of\s+scope)", "Explicit gap acknowledged"),
        (r"(?i)(only\s+(works|handles|supports)\s+\w+)", "Limited scope noted"),
        (r"(?i)(assume[sd]?\s+that|assuming\s+the)", "Assumption without verification"),
    ]
    deductions = 0
    for pattern, label in gap_signals:
        if re.search(pattern, text):
            deductions += 1
            evidence.append(f"- {label}")
    if deductions >= 2:
        score = 3
    elif deductions == 1:
        score = 4
    if not evidence:
        evidence.append("No completeness signals — unable to assess coverage")
    result = AxisScore(name="Completeness", score=score, evidence=evidence)
    if score < 5:
        result.improvement = "List what was covered AND what was intentionally excluded, with reasoning"
    return result
 def check_clarity(text: str) -> AxisScore:
    """Check for structure, readability, jargon handling."""
    evidence = []
    score = 5
    deductions = 0
    # Positive signals
    if re.search(r"^#{1,3}\s+", text, re.MULTILINE):
        evidence.append("+ Uses headings for structure")
    if re.search(r"```", text):
        evidence.append("+ Uses code blocks")
    if re.search(r"^\s*[-*]\s+", text, re.MULTILINE):
        evidence.append("+ Uses bullet points")
    # Negative signals
    # Wall of text: long paragraph without breaks
    paragraphs = [p for p in text.split("\n\n") if p.strip()]
    for p in paragraphs:
        if count_words(p) > 200:
            deductions += 1
            evidence.append("- Wall-of-text paragraph (>200 words without break)")
            break
    # Jargon without definition
    jargon = [
        (r"\b(idempotent|race condition|deadlock|thundering herd)\b", "concurrency"),
        (r"\b(exponential backoff|circuit breaker|bulkhead)\b", "resilience"),
        (r"\b(ACID|CAP|eventual consistency|linearizability)\b", "database theory"),
    ]
    for pattern, domain in jargon:
        if re.search(pattern, text, re.IGNORECASE):
            if not re.search(rf"(?i)({domain}|means|refers to|i\.e\.|in other words)", text):
                deductions += 1
                evidence.append(f"- Domain term used without explanation ({domain})")
                break
    if not any(t in text[:100].lower() for t in ["summary", "tldr", "overview", "in short"]):
        # No early summary — penalize only if text is long
        if count_words(text) > 300:
            deductions += 1
            evidence.append("- No summary/TLDR in first 100 words (text is 300+ words)")
    if deductions >= 3:
        score = 2
    elif deductions == 2:
        score = 3
    elif deductions == 1:
        score = 4
    if not evidence:
        evidence.append("+ Well-structured with no clarity issues detected")
    result = AxisScore(name="Clarity", score=score, evidence=evidence)
    if score < 5:
        result.improvement = "Add headings, break long paragraphs, define domain terms on first use"
    return result
 def check_actionability(text: str) -> AxisScore:
    """Check if the user can act on the output immediately."""
    evidence = []
    score = 5
    deductions = 0
    # Positive signals
    actionable_signals = [
        (r"(?i)(merge|PR|pull request).*?(created|ready|open)", "PR created"),
        (r"(?i)(run|execute)\s+[`\"']?[\w./-]+", "Specific run command given"),
        (r"(?i)(next\s+steps?|follow[- ]up|what\s+to\s+do)", "Next steps provided"),
        (r"(?i)(file\s+(created|written|modified|updated)\s+at)", "File path specified"),
    ]
    for pattern, label in actionable_signals:
        if re.search(pattern, text):
            evidence.append(f"+ {label}")
    # Negative signals
    vague_signals = [
        (r"(?i)(you\s+(should|could|might\s+want\s+to))\s+\w+", "Vague suggestion without specifics"),
        (r"(?i)(consider|maybe|perhaps)\s+\w+ing", "Non-committal suggestion"),
        (r"(?i)(figure\s+out|look\s+into|investigate)\s", "Defers work to user"),
    ]
    for pattern, label in vague_signals:
        if re.search(pattern, text):
            deductions += 1
            evidence.append(f"- {label}")
    if deductions >= 3:
        score = 2
    elif deductions == 2:
        score = 3
    elif deductions == 1:
        score = 4
    if not evidence:
        evidence.append("No actionability signals — user may need to ask 'what now?'")
    result = AxisScore(name="Actionability", score=score, evidence=evidence)
    if score < 5:
        result.improvement = "End with a single clear action: 'Merge this PR', 'Run ./deploy.sh', or 'Review the 3 changed files'"
    return result
 def check_concision(text: str, task: Optional[str] = None) -> AxisScore:
    """Check for redundancy, filler, information density."""
    evidence = []
    score = 5
    wc = count_words(text)
    # Heuristic: task-to-output ratio
    if task:
        task_wc = count_words(task)
        ratio = wc / max(task_wc, 1)
        if ratio > 15:
            evidence.append(f"- Output is {ratio:.0f}x longer than task description (high ratio)")
            score = min(score, 3)
        elif ratio > 8:
            evidence.append(f"- Output is {ratio:.0f}x longer than task description")
            score = min(score, 4)
    # Redundancy signals
    redundancy_checks = [
        (r"(?i)(as\s+(I|we)\s+(mentioned|said|noted|discussed)\s+(earlier|above|before))",
         "Refers back to earlier statement (possible repetition)"),
        (r"(?i)(to\s+summarize|in\s+summary|in\s+conclusion|to\s+conclude)",
         "Has explicit summary (good if needed, flag if redundant)"),
        (r"(?i)(let\s+me\s+(explain|break\s+this\s+down|walk\s+you\s+through))",
         "Meta-commentary adds words without information"),
    ]
    redundant_count = 0
    for pattern, label in redundancy_checks:
        matches = re.findall(pattern, text)
        if len(matches) > 2:
            redundant_count += 1
            evidence.append(f"- '{label}' appears {len(matches)} times")
    if redundant_count >= 2:
        score = min(score, 3)
    elif redundant_count == 1:
        score = min(score, 4)
    if not evidence and score == 5:
        evidence.append("+ No redundancy detected. Information density appears good.")
    result = AxisScore(name="Conciseness", score=score, evidence=evidence)
    if score < 5:
        result.improvement = "Cut meta-commentary, remove repeated points, trim examples to one representative case"
    return result
 def evaluate(task: Optional[str], output: str) -> list[AxisScore]:
    """Run all 5 axis checks and return scored results."""
    return [
        check_accuracy(output),
        check_completeness(output, task),
        check_clarity(output),
        check_actionability(output),
        check_concision(output, task),
    ]
 def format_report(scores: list[AxisScore]) -> str:
    """Format scores into a readable evaluation report."""
    avg = sum(s.score for s in scores) / len(scores)
    lines = []
    lines.append("=" * 60)
    lines.append("AGENT SELF-EVALUATION REPORT")
    lines.append("=" * 60)
    lines.append("")
    for s in scores:
        bar = "█" * s.score + "░" * (5 - s.score)
        lines.append(f"  {s.name:<15} {bar} {s.score}/5")
        for e in s.evidence:
            lines.append(f"    {e}")
        if s.improvement:
            lines.append(f"    → {s.improvement}")
        lines.append("")
    lines.append(f"  {'OVERALL':<15} {avg:.1f}/5")
    lines.append("")
    # Top improvements
    improvements = [(s, s.improvement) for s in scores if s.improvement and s.score < 4]
    if improvements:
        lines.append("TOP IMPROVEMENTS (axes scoring < 4):")
        for s, imp in sorted(improvements, key=lambda x: x[0].score):
            lines.append(f"  [{s.name}] {imp}")
    else:
        lines.append("No axes below 4. Strong output across all dimensions.")
    return "\n".join(lines)
 def main():
    parser = argparse.ArgumentParser(
        description="Evaluate agent output against the 5-axis rubric"
    )
    parser.add_argument("--task", help="Task description (file path or inline text)")
    parser.add_argument("--output", help="Agent output to evaluate (file path)")
    parser.add_argument("--interactive", action="store_true", help="Prompt for task and read output from stdin")
    args = parser.parse_args()
    task = None
    output = None
    if args.interactive:
        task = input("Task description: ").strip()
        print("Paste agent output (Ctrl+D to finish):")
        output = sys.stdin.read()
    elif args.task and args.output:
        # Read task
        try:
            with open(args.task) as f:
                task = f.read()
        except FileNotFoundError:
            task = args.task  # Treat as inline text
        # Read output
        try:
            with open(args.output) as f:
                output = f.read()
        except FileNotFoundError:
            print(f"Error: output file '{args.output}' not found", file=sys.stderr)
            sys.exit(1)
    else:
        # Pipe mode: read output from stdin
        output = sys.stdin.read()
        if args.task:
            try:
                with open(args.task) as f:
                    task = f.read()
            except FileNotFoundError:
                task = args.task
    if not output:
        print("Error: no output to evaluate", file=sys.stderr)
        sys.exit(1)
    scores = evaluate(task, output)
    print(format_report(scores))
 if __name__ == "__main__":
    main()
--- a/skills/agent-self-evaluation/templates/evaluation-report.md
+++ b/skills/agent-self-evaluation/templates/evaluation-report.md
@ -0,0 +1,70 @@
 # Agent Self-Evaluation Report Template
 Copy this template and fill in after completing a task. Keep it in your conversation — the user sees it inline.
 ```
 ============================================================
 AGENT SELF-EVALUATION REPORT
 ============================================================
  Accuracy         █████ 5/5    or    ███░░ 3/5
    + [Evidence: passing tests, verified claims]
    - [Gaps: unverified claims, hedging language]
  Completeness      █████ 5/5
    + [What's covered: all requirements + edge cases]
    - [What's missing: explicitly acknowledge gaps]
  Clarity           █████ 5/5
    + [Structure: headings, code blocks, bullet points]
    - [Issues: undefined terms, wall of text, no summary]
  Actionability     █████ 5/5
    + [User can: merge PR, run command, review file]
    - [Blockers: missing steps, vague suggestions]
  Conciseness       █████ 5/5
    + [Tight: no repetition, high information density]
    - [Bloat: filler, meta-commentary, repeated points]
  OVERALL           X.X/5
 TOP IMPROVEMENTS:
  [Only list axes scoring < 4, ranked by user impact]
 ```
 ## Quick Reference: Scoring Triggers
 | If you see this... | Accuracy | Completeness | Clarity | Actionability | Conciseness |
 |---|---|---|---|---|---|
 | "should work" / "probably fine" | ≤4 | — | — | — | — |
 | "I think" / "I believe" | ≤4 | — | — | — | — |
 | No test output cited | ≤4 | — | — | — | — |
 | "TODO" / "FIXME" left behind | ≤3 | ≤3 | — | ≤3 | — |
 | Missing error handling | — | ≤3 | — | — | — |
 | Only happy path covered | — | ≤3 | — | — | — |
 | Wall-of-text paragraph (>200 words) | — | — | ≤3 | — | — |
 | No headings or structure | — | — | ≤3 | — | — |
 | "You should..." without specifics | — | — | — | ≤3 | — |
 | No PR or file created | — | — | — | ≤3 | — |
 | User needs to figure out next step | — | — | — | ≤2 | — |
 | Repeated points (3+ times) | — | — | — | — | ≤3 |
 | "Let me explain..." / "To summarize..." x3+ | — | — | — | — | ≤3 |
 | Output >15x longer than task | — | — | — | — | ≤3 |
 ## When to Skip
 Skip the evaluation if:
 - Task was a single tool call (e.g., "read this file" — nothing to evaluate)
 - User explicitly says "don't evaluate" or "just do it"
 - Task is purely conversational (greeting, small talk)
 - You're mid-workflow and the user will judge the final output, not intermediate steps
 ## Post-Evaluation Actions
 | Overall Score | What to do |
 |---|---|
 | ≥4.5 | Deliver. No changes needed. |
 | 3.5–4.4 | Flag the top improvement but deliver. Fix if <30 seconds. |
 | 2.5–3.4 | State what you'd change. Ask user: "Should I redo [axis] or deliver as-is?" |
 | <2.5 | Don't deliver. Say: "This scored __ because __. Let me redo this with [specific fix]." Then redo. |