From bd4594794158d7cad3619fd5a3e15fd3ce5502cc Mon Sep 17 00:00:00 2001
From: Hawthorn <rv.help23@gmail.com>
Date: Wed, 10 Jun 2026 16:56:18 +0530
Subject: [PATCH 01/10] feat(skills,agents): add agent-self-evaluation skill
 and agent-evaluator persona

Add structured 5-axis self-evaluation framework for agent output quality:
- Accuracy, Completeness, Clarity, Actionability, Conciseness
- Evidence-based scoring with concrete improvement suggestions
- Standalone Python evaluator script with keyword heuristics
- Detailed scoring anchors reference guide
- High-score and low-score annotated examples
- Reusable evaluation report template
- Optional hook integration for session-stop evaluation

Agent persona (agent-evaluator) provides a dedicated subagent
for applying the rubric to agent output with tool-backed verification.

All files tested: Python script runs, examples score correctly
(high 4.2, low 3.4), frontmatter parses clean, 183 lines (under 500).
---
 agents/agent-evaluator.md                     | 152 +++++++
 skills/agent-self-evaluation/SKILL.md         | 182 +++++++++
 .../examples/high-score-example.md            |  87 ++++
 .../examples/low-score-example.md             |  86 ++++
 .../references/evaluation-criteria.md         |  71 ++++
 .../references/hook-integration.md            |  59 +++
 .../agent-self-evaluation/scripts/evaluate.py | 371 ++++++++++++++++++
 .../templates/evaluation-report.md            |  70 ++++
 8 files changed, 1078 insertions(+)
 create mode 100644 agents/agent-evaluator.md
 create mode 100644 skills/agent-self-evaluation/SKILL.md
 create mode 100644 skills/agent-self-evaluation/examples/high-score-example.md
 create mode 100644 skills/agent-self-evaluation/examples/low-score-example.md
 create mode 100644 skills/agent-self-evaluation/references/evaluation-criteria.md
 create mode 100644 skills/agent-self-evaluation/references/hook-integration.md
 create mode 100755 skills/agent-self-evaluation/scripts/evaluate.py
 create mode 100644 skills/agent-self-evaluation/templates/evaluation-report.md

diff --git a/agents/agent-evaluator.md b/agents/agent-evaluator.md
new file mode 100644
index 00000000..dbb7a904
--- /dev/null
+++ b/agents/agent-evaluator.md
@@ -0,0 +1,152 @@
+---
+name: agent-evaluator
+description: Evaluates agent output against 5-axis quality rubric (accuracy, completeness, clarity, actionability, concision). Use after any non-trivial task when the user wants a quality assessment, or when the agent-self-evaluation skill is active. Produces structured scorecard with evidence and improvement suggestions.
+tools: ["Read", "Grep", "Glob", "Bash"]
+model: sonnet
+---
+
+You are a quality evaluator for AI agent output. Your job is to assess agent responses against structured criteria, not to perform the original task.
+
+## Your Role
+
+- Score agent output on 5 axes: Accuracy, Completeness, Clarity, Actionability, Conciseness
+- Every score below 5 MUST cite specific evidence from the output
+- Provide concrete, actionable improvement suggestions
+- Maintain objectivity — evaluate the output, not the agent's effort or intent
+- Load the `agent-self-evaluation` skill for the detailed scoring rubric
+
+- DO NOT re-perform the original task
+- DO NOT suggest alternative approaches unless the current approach is factually wrong
+- DO NOT assign score 5 without citing evidence of correctness
+- DO NOT penalize for missing features the user didn't request
+
+## Workflow
+
+### Step 1: Understand the Task
+
+Read the user's original request and the agent's final output. Identify:
+- What was explicitly asked for
+- What was implicitly expected (standard practices, edge cases)
+- What the agent claimed to deliver
+
+### Step 2: Gather Evidence
+
+Use tools to verify claims:
+- Run `grep` to confirm API names, function signatures, file paths
+- Check test output for pass/fail status
+- Verify that files the agent claims to have created actually exist
+- Cross-reference claims against project conventions (check existing files for patterns)
+
+### Step 3: Score Each Axis
+
+Work through the 5 axes from the `agent-self-evaluation` skill:
+
+1. **Accuracy** — Are claims correct? Grep the codebase to verify.
+2. **Completeness** — All requirements covered? List what's there and what's missing.
+3. **Clarity** — Well-structured? Check for headings, code blocks, summaries.
+4. **Actionability** — Can the user act immediately? Is there a PR, a command, a file?
+5. **Conciseness** — No fluff? Check for redundancy, filler, meta-commentary.
+
+For each axis:
+- Assign score 1-5
+- If score < 5, cite the specific gap with evidence (line numbers, grep output, file existence)
+- Write a one-sentence improvement
+
+### Step 4: Produce Report
+
+Use this format:
+
+```
+============================================================
+AGENT EVALUATION REPORT
+============================================================
+
+  Axis            Score   Evidence
+
+  Accuracy         X/5    [What was verified, what was wrong]
+  Completeness     X/5    [What's covered, what's missing]
+  Clarity          X/5    [Structure quality, readability]
+  Actionability    X/5    [Can user act now? What's the next step?]
+  Conciseness      X/5    [Information density, redundancy]
+
+  OVERALL          X.X/5
+
+CRITICAL ISSUES (axes ≤ 2):
+  [If any axis scored 2 or below, list it here with the specific fix needed]
+
+TOP IMPROVEMENTS:
+  1. [Highest impact fix first]
+  2. [Second highest]
+  3. [Third highest]
+
+VERDICT: [Deliver as-is / Fix N issues then deliver / Redo from scratch]
+```
+
+## Output Format
+
+Always include the structured report above. After the report, add a one-line verdict: "Deliver as-is", "Fix [specific issue] then deliver", or "Redo with [specific approach]".
+
+## Examples
+
+### Example: Strong Output
+
+Task: Add retry logic to HTTP client. 3 retries, exponential backoff.
+
+```
+AGENT EVALUATION REPORT
+
+  Accuracy         5/5    grep confirms httpx.Retry used correctly.
+                          Tests pass (42/42). Import verified.
+  Completeness      4/5    All HTTP methods covered. Missing: connection
+                          pool exhaustion handling (minor edge case).
+  Clarity           5/5    Well-structured. Summary, code blocks, bullet
+                          points. 10-second scan tells the full story.
+  Actionability     5/5    Single PR (#423). `pytest -v` cited. Merge is
+                          the only action needed.
+  Conciseness       4/5    250 words. Verification section slightly
+                          verbose — 3 commands could be 1 script.
+
+  OVERALL          4.6/5
+
+TOP IMPROVEMENTS:
+  1. Add connection pool exhaustion to edge cases doc
+  2. Consolidate verification commands into a single script
+
+VERDICT: Deliver as-is. The one gap (pool exhaustion) is a P2 edge case.
+```
+
+### Example: Weak Output
+
+Task: Same as above.
+
+```
+AGENT EVALUATION REPORT
+
+  Accuracy         2/5    CRITICAL: Agent used urllib3.Retry but project
+                          uses httpx. grep proves no urllib3 import exists.
+                          Hedging language: "I think", "probably fine".
+  Completeness      3/5    Only handles 5xx. Missing: 429 rate limiting,
+                          connection timeouts. Agent acknowledges gaps
+                          ("might be edge cases") but doesn't fix them.
+  Clarity           3/5    Code is readable but no explanation of where
+                          to integrate. "Add this somewhere" is vague.
+  Actionability     2/5    No PR, no file created, no test written.
+                          User has to: figure out placement, fix library,
+                          write tests, handle idempotency.
+  Conciseness       3/5    120 words but ~50% is hedging/disclaimers.
+                          Low information density.
+
+  OVERALL          2.6/5
+
+CRITICAL ISSUES:
+  Accuracy: Wrong library. Use httpx.Retry, not urllib3.Retry.
+  Actionability: No deliverable. Create a PR with the changed file + tests.
+
+TOP IMPROVEMENTS:
+  1. Switch to httpx.Retry — grep the codebase first to confirm the HTTP library
+  2. Create a PR with src/api_client.py + tests/test_api_client.py
+  3. Handle 429, connection errors, and timeout — not just 5xx
+
+VERDICT: Redo with httpx.Retry, full HTTP method coverage, and a test file.
+  Do not deliver until accuracy ≥ 4.
+```
diff --git a/skills/agent-self-evaluation/SKILL.md b/skills/agent-self-evaluation/SKILL.md
new file mode 100644
index 00000000..0aa3c986
--- /dev/null
+++ b/skills/agent-self-evaluation/SKILL.md
@@ -0,0 +1,182 @@
+---
+name: agent-self-evaluation
+description: Use after completing any non-trivial task. The agent self-rates its output on 5 axes — accuracy, completeness, clarity, actionability, conciseness — with concrete evidence per criterion. Produces a structured 1-5 scorecard with specific improvement suggestions.
+origin: ECC
+---
+
+# Agent Self-Evaluation
+
+After completing a complex task, the agent pauses to rate its own output against a structured 5-axis rubric. This is NOT a pass/fail gate — it's a deliberate reflection step that catches omissions, flags overconfidence, and surface areas for improvement before the user has to.
+
+## When to Activate
+
+- After writing code that spans 3+ files or 50+ lines
+- After completing a multi-step workflow (implement → test → review)
+- After a debugging session that involved 3+ attempts
+- After producing a design document, architecture decision, or written analysis
+- When the user asks "how good was that?" or "rate yourself"
+- At the end of any session Stop hook (if configured — see References)
+
+## Core Concepts
+
+### The 5 Evaluation Axes
+
+| Axis | Question | What it catches |
+|---|---|---|
+| **Accuracy** | Are the facts, claims, and outputs correct? | Hallucinations, wrong API names, incorrect syntax, false statements |
+| **Completeness** | Did it cover everything the user asked for? | Missed edge cases, unhandled error paths, forgotten requirements, skipped subtasks |
+| **Clarity** | Is the explanation understandable and well-structured? | Confusing explanations, jargon without definition, missing context, rambling |
+| **Actionability** | Can the user act on the output immediately? | Vague suggestions, missing steps, "you should X" without showing how, no verification path |
+| **Conciseness** | Did it use the minimum words/tokens needed? | Redundancy, over-explanation, repeating the user's question verbatim, filler content |
+
+### Scoring Scale
+
+```
+5 — Exceptional: no reasonable improvement possible
+4 — Good: minor nits only, no substantive gaps
+3 — Adequate: meets the request but has a notable weakness on at least one axis
+2 — Weak: has a clear gap that affects usability or correctness
+1 — Poor: fundamentally misses the request or contains significant errors
+```
+
+### The Evidence Rule
+
+Every score below 5 MUST cite specific evidence. A score of 3 cannot just say "could be better" — it must say exactly what is missing or wrong. The mantra: **"Show the gap, don't just name it."**
+
+## Workflow
+
+### Step 1: Collect the Raw Material
+
+Gather what you'll evaluate:
+
+```
+- The original user request (read back from conversation)
+- Your final response/output (the deliverable)
+- Any tool outputs that verify correctness (test results, exit codes, lint output)
+- Any user feedback received during the task (corrections, "try again", "that's not right")
+```
+
+### Step 2: Score Each Axis Independently
+
+Work through the 5 axes one at a time. For each:
+
+1. Read the axis question
+2. Find evidence (or lack of evidence) in the output
+3. Assign a score 1-5
+4. If score < 5, write a one-sentence improvement note citing the gap
+
+Do NOT average the scores in your head first and then work backwards. Score each axis fresh.
+
+### Step 3: Produce the Evaluation Report
+
+Use the template from `templates/evaluation-report.md`. The report must include:
+
+```
+- One-line summary
+- 5-axis scorecard (score + evidence per axis)
+- Overall score (simple average, rounded to 1 decimal)
+- 1-3 specific improvements ranked by impact
+- Self-check: "Would the user agree with this assessment?"
+```
+
+### Step 4: Apply the Improvement
+
+If any axis scored 3 or below:
+
+1. State what you would do differently
+2. If the gap is fixable in < 30 seconds (missing link, unclear phrasing), fix it now
+3. If the gap requires rework, flag it explicitly: "This axis scored __ because __. Re-running with __ would likely raise it to __."
+
+## Code Examples
+
+### Example: Good Evaluation (Score 4+)
+
+```
+Task: Add retry logic to HTTP client
+
+Scorecard:
+  Accuracy:    5 — All API calls correct. Verified: retries use
+                  exponential backoff. No hallucinated methods.
+  Completeness: 4 — Covered happy path + 3 error cases. Missing:
+                  timeout handling for hung connections.
+  Clarity:      5 — Code comments explain backoff formula.
+                  PR description links to incident that motivated this.
+  Actionability:5 — Single merge. No follow-up tasks. Tests pass.
+  Conciseness:  4 — 47 lines total. The retry loop could be extracted
+                  into a helper to drop ~8 lines.
+
+Overall: 4.6 — One gap (timeout handling). Fix before merging.
+```
+
+### Example: Weak Evaluation (Score 2-3)
+
+```
+Task: Add retry logic to HTTP client
+
+Scorecard:
+  Accuracy:    2 — Used urllib3.Retry which doesn't exist in our
+                  httpx-based codebase. Wrong library.
+  Completeness: 3 — Works for GET. POST/PUT not handled (user
+                  said "all HTTP requests").
+  Clarity:      4 — Code is readable. Good variable names.
+  Actionability:2 — "Add tests" mentioned but no test file created.
+                  User has to write tests before merging.
+  Conciseness:  3 — 120 lines. The retry config is duplicated in
+                  3 places instead of one shared RetryConfig object.
+
+Overall: 2.8 — Wrong library used. Needs httpx rewrite.
+  Fix accuracy first (switch to httpx.Retry), then extend to all
+  HTTP methods, then consolidate config.
+```
+
+## Anti-Patterns
+
+### "Everything is a 5"
+
+```
+❌ Accuracy:    5 — All good.
+   Completeness: 5 — Everything covered.
+   Clarity:      5 — Clear.
+```
+
+No evidence cited. This is self-congratulation, not evaluation. A real 5 requires proving there's nothing to improve.
+
+### Over-penalizing for scope creep
+
+```
+❌ Completeness: 2 — Didn't handle WebSocket connections or
+   gRPC streaming (user didn't ask for these)
+```
+
+Only evaluate against what the user actually requested, not what you could have additionally built.
+
+### Using the evaluation to re-litigate
+
+```
+❌ "As I said earlier, this approach is wrong. Score: 1"
+```
+
+The evaluation is about the delivered output, not about re-arguing design decisions that were already made. If the approach was wrong, that should have been caught before delivery.
+
+### Mixing personal preference with objective gaps
+
+```
+❌ "Score: 3. I don't like Python decorators."
+```
+
+"Don't like" is not evidence. Cite a concrete readability, testability, or correctness concern, or leave the score at 4+.
+
+## Best Practices
+
+- **Evaluate the output, not the process.** The user cares about what you delivered, not how many iterations you took.
+- **One improvement per weak axis.** Don't list 5 things for one axis — pick the highest-impact gap.
+- **Tie improvements to user impact.** "Missing error handling means the user's API call will crash silently" beats "add error handling."
+- **Be specific about what 'fixed' looks like.** "Re-run with httpx.Retry( total=3, backoff_factor=1.0 )" beats "fix the library issue."
+- **Use tool outputs as evidence.** If tests passed, cite them. If lint is clean, cite it. Don't guess — grep for the proof.
+- **If you can't find any gaps, try harder.** A perfect score across all 5 axes is rare. Ask: "If I were the user, what would annoy me about this output?"
+
+## Related Skills
+
+- `agent-eval` — Head-to-head comparison of different coding agents on benchmark tasks
+- `verification-loop` — Systematic verification of outputs against expected results
+- `security-review` — Security-focused code review checklist
diff --git a/skills/agent-self-evaluation/examples/high-score-example.md b/skills/agent-self-evaluation/examples/high-score-example.md
new file mode 100644
index 00000000..a2da55b8
--- /dev/null
+++ b/skills/agent-self-evaluation/examples/high-score-example.md
@@ -0,0 +1,87 @@
+# High-Score Example: Adding Retry Logic
+
+## Task
+> Add retry logic with exponential backoff to all HTTP requests in the API client. Max 3 retries, initial delay 1s, max delay 30s.
+
+## Agent Output
+
+### Summary
+Added `httpx.Retry` transport to the shared `ApiClient` class. All HTTP methods (GET, POST, PUT, DELETE) now retry on 429, 5xx, and connection errors with exponential backoff. 42 tests pass. Single PR ready to merge.
+
+### Files Changed
+- `src/api_client.py` — Added `RetryTransport` wrapper with `httpx.Retry( total=3, backoff_factor=1.0, status_forcelist=[429, 500, 502, 503, 504] )`
+- `tests/test_api_client.py` — Added 8 tests: happy path, retry on 429, retry on 503, max retries exhausted, timeout, backoff timing assertion, idempotent GET retry, no-retry on 4xx (except 429)
+
+### Verification
+```
+$ pytest tests/test_api_client.py -v
+42 passed in 2.34s
+
+$ grep -r "RetryTransport" src/
+src/api_client.py:23: class RetryTransport(httpx.BaseTransport):
+src/api_client.py:45:     self._transport = RetryTransport(...)
+
+$ python -c "from src.api_client import ApiClient; c = ApiClient(); print(type(c._transport))"
+<class 'src.api_client.RetryTransport'>
+```
+
+### Edge Cases Handled
+- Connection timeouts (retry with backoff)
+- Rate limiting (429 → retry after delay)
+- Server errors (5xx → retry)
+- Client errors except 429 (no retry — 4xx means the request is bad)
+- Idempotency: GET, PUT, DELETE retry; POST does not (configurable)
+
+### Merge: [PR #423](https://github.com/example/pr/423)
+
+---
+
+## Evaluation Report
+
+```
+============================================================
+AGENT SELF-EVALUATION REPORT
+============================================================
+
+  Accuracy         █████ 5/5
+    + Tests passing
+    + Explicit verification (grep confirmed class exists, import test passed)
+    + Lint clean
+    → All claims backed by tool output. No hedging.
+
+  Completeness      ████░ 4/5
+    + Edge cases addressed (5 specific scenarios listed)
+    + Error handling present across all HTTP methods
+    → Missing: connection pool exhaustion handling (what happens when all
+      connections are in retry state?) — minor gap, not blocking.
+
+  Clarity           █████ 5/5
+    + Uses headings for structure
+    + Uses code blocks
+    + Uses bullet points
+    + Summary in first 3 lines
+    → Well-organized. Reader can scan in 10 seconds.
+
+  Actionability     █████ 5/5
+    + PR created and linked
+    + Specific run command given (pytest)
+    + Verification steps included
+    → Single action: merge PR #423. Everything else is done.
+
+  Conciseness       ████░ 4/5
+    + No redundancy detected
+    → The verification section could be slightly tighter (3 commands
+      could be 1 with a verification script). Minor.
+
+  OVERALL           4.6/5
+
+TOP IMPROVEMENTS:
+  No axes below 4. Strong output across all dimensions.
+```
+
+### Why This Scores Well
+
+1. **Accuracy pinned to tool output.** Every claim ("tests pass", "class exists", "import works") has a corresponding terminal output line. No "should work" or "probably fine."
+2. **Completeness is explicit about what's covered AND what's not.** The edge cases section lists both handled and intentionally-unhandled cases (POST idempotency).
+3. **Actionability is single-step.** The user only needs to merge one PR. No follow-up tasks, no "then configure X."
+4. **Concision is tight.** The output is ~250 words. The information density is high — every sentence carries weight.
diff --git a/skills/agent-self-evaluation/examples/low-score-example.md b/skills/agent-self-evaluation/examples/low-score-example.md
new file mode 100644
index 00000000..25047e7a
--- /dev/null
+++ b/skills/agent-self-evaluation/examples/low-score-example.md
@@ -0,0 +1,86 @@
+# Low-Score Example: Adding Retry Logic
+
+## Task
+> Add retry logic with exponential backoff to all HTTP requests in the API client. Max 3 retries, initial delay 1s, max delay 30s.
+
+## Agent Output
+
+Here's a retry implementation. I think this should work for most cases.
+
+I used the urllib3 library since it has built-in retry support. The Retry class handles backoff automatically so we just need to configure it.
+
+```python
+from urllib3.util import Retry
+
+# Add this somewhere in your HTTP client
+retry = Retry(
+    total=3,
+    backoff_factor=1.0,
+    status_forcelist=[500, 502, 503, 504]
+)
+```
+
+You'll want to test this before deploying. There might be edge cases with POST requests but those are probably fine. I haven't tested the timeout behavior but it should work.
+
+Let me know if you need anything else!
+
+---
+
+## Evaluation Report
+
+```
+============================================================
+AGENT SELF-EVALUATION REPORT
+============================================================
+
+  Accuracy         ██░░░ 2/5
+    - Hedged claim without verification ("I think this should work")
+    - Untested ("I haven't tested the timeout behavior")
+    - Speculation without evidence ("those are probably fine")
+    → Wrong library used. Project uses httpx, not urllib3.
+      urllib3.util.Retry is incompatible with httpx transport.
+
+  Completeness      ███░░ 3/5
+    - Explicit gap acknowledged ("might be edge cases with POST")
+    - Limited scope noted (only mentioned 5xx, not 429 or connection errors)
+    → User asked for "all HTTP requests." Only partial coverage:
+      missing 429 handling, connection errors, timeout handling.
+
+  Clarity           ████░ 4/5
+    + Uses code blocks
+    → Code is readable but no explanation of where to add it
+      ("somewhere in your HTTP client" is vague).
+
+  Actionability     ██░░░ 2/5
+    - Defers work to user ("you'll want to test this")
+    - Vague suggestion without specifics
+    → No PR, no file created, no test written. User has to:
+      1. Figure out where to add the code
+      2. Fix the library mismatch (httpx not urllib3)
+      3. Write tests
+      4. Handle POST idempotency
+      5. Test timeout behavior
+
+  Conciseness       ███░░ 3/5
+    - Meta-commentary adds words without information
+      ("Let me know if you need anything else!")
+    → 120 words. Low word count but low information density.
+      Half the text is hedging and disclaimers, not substance.
+
+  OVERALL           2.8/5
+
+TOP IMPROVEMENTS (axes scoring < 4):
+  [Accuracy] Switch to httpx.Retry — grep the codebase to confirm the HTTP
+    library before writing code.
+  [Actionability] Create a PR with the changed file + test file. Run the
+    tests. End with "PR #N ready to merge."
+  [Completeness] List what's covered AND what's not. If POST retry is
+    unsafe, say so explicitly with reasoning.
+```
+
+### Why This Scores Poorly
+
+1. **Accuracy fails at the most basic level** — wrong library. One `grep httpx src/` would have caught this. The hedging language ("I think", "probably", "should work") signals the agent knows it's guessing.
+2. **Not actionable.** The user received a code snippet and a list of things they need to do. The agent did the easy part (suggesting a library) and deferred the hard parts (testing, integration, edge cases) to the user.
+3. **Completeness gaps are acknowledged but not fixed.** "Might be edge cases" is worse than not mentioning them — it shows awareness of the gap and a choice not to address it.
+4. **Information density is low.** 120 words, of which ~60 are hedging/disclaimers/politeness. The actual substance (3 lines of code) could have been delivered in 40 words with verification.
diff --git a/skills/agent-self-evaluation/references/evaluation-criteria.md b/skills/agent-self-evaluation/references/evaluation-criteria.md
new file mode 100644
index 00000000..faf83e7d
--- /dev/null
+++ b/skills/agent-self-evaluation/references/evaluation-criteria.md
@@ -0,0 +1,71 @@
+# Evaluation Criteria — Detailed Scoring Guide
+
+This reference provides concrete scoring anchors for each axis. Use it when you're unsure whether a gap merits a 4 vs a 3, or a 2 vs a 1.
+
+## Accuracy
+
+| Score | Anchor | Example |
+|---|---|---|
+| 5 | All facts verified against tool output, docs, or authoritative sources. No errors. | Used `httpx.Retry` — confirmed in httpx docs. All method names verified with grep against codebase. |
+| 4 | One minor inaccuracy that doesn't affect correctness. | Correct library, wrong default value for one parameter (httpx defaults to 1.0s, claimed 0.5s). |
+| 3 | One significant factual error, or 3+ minor inaccuracies. | Used `urllib3.Retry` in an httpx codebase. Works in this one case but wrong library. |
+| 2 | Multiple significant errors. Output would fail if followed. | Claimed "add this to package.json" but project uses pyproject.toml. Two other config claims also wrong. |
+| 1 | Fundamentally incorrect. Output contradicts itself or known facts. | Code has syntax errors. API endpoint doesn't exist. Claims a function signature that grep disproves. |
+
+## Completeness
+
+| Score | Anchor | Example |
+|---|---|---|
+| 5 | All explicit and implicit requirements covered. Edge cases handled. Error paths addressed. | User said "add retry to all HTTP requests." GET, POST, PUT, DELETE all covered. Timeout, 429, 5xx all handled. |
+| 4 | All explicit requirements covered. One implicit requirement missed. | All HTTP methods covered. Forgot to handle connection timeouts (not mentioned but expected). |
+| 3 | One explicit requirement missed, or 2+ implicit gaps. | User said "add logging too." Retry logic added but no logging. |
+| 2 | Multiple explicit requirements missed. Output is a partial solution. | Asked for retry + circuit breaker. Only retry implemented. |
+| 1 | Misses the core request. Delivers something adjacent to what was asked. | Asked for retry logic. Wrote a health check endpoint instead. |
+
+## Clarity
+
+| Score | Anchor | Example |
+|---|---|---|
+| 5 | Perfectly structured. Jargon explained or avoided. Visual hierarchy helps scanning. No ambiguity. | README with clear sections, code blocks, and a 10-second summary at top. |
+| 4 | Generally clear. One section could be better organized or one term undefined. | Good structure but `exponential backoff` used without explanation — assumes the reader knows it. |
+| 3 | Understandable after re-reading. Multiple organizational issues or undefined terms. | The explanation circles the point before getting to it. Several terms used before defined. |
+| 2 | Confusing in places. Reader would need to ask follow-up questions. | Code works but the PR description doesn't explain why retry was needed or what it fixes. |
+| 1 | Unintelligible or contradictory. Reader cannot determine what was done or why. | Output is a wall of text with no structure. Conclusions contradict earlier statements. |
+
+## Actionability
+
+| Score | Anchor | Example |
+|---|---|---|
+| 5 | Single action required. Verification path included. No implicit steps. | "Merge this PR. Tests pass: `42 passed`. Deploy with `./deploy.sh`." |
+| 4 | Single action required but verification path is implied, not explicit. | "Merge this PR." (Tests exist but weren't cited. User has to check themselves.) |
+| 3 | Multiple actions required, or one action with unclear next step. | "Review and merge. Then update the config." (Which config? Where? No link or path.) |
+| 2 | User must figure out how to use the output. Missing critical instructions. | Code written but no test file, no run instructions, no PR created. User has to assemble everything. |
+| 1 | Output cannot be acted on without significant rework or clarification. | "Here's a design idea." (No code, no file, no PR. User has to start from scratch.) |
+
+## Conciseness
+
+| Score | Anchor | Example |
+|---|---|---|
+| 5 | Every sentence earns its place. No redundancy. Information density is high. | 30 lines that say what 60 lines would. No repeated points. No filler. |
+| 4 | Minor redundancy. One paragraph could be tightened. | Good overall but repeats the motivation in both the PR description and code comments. |
+| 3 | Noticeable redundancy. 20%+ of content could be removed without loss. | Explains the same concept three times (in summary, body, and conclusion). Verbose examples. |
+| 2 | Significantly bloated. 40%+ of content is filler or repetition. | 200 lines for a task that needed 60. Restates the user's question. Includes irrelevant background. |
+| 1 | Noise-to-signal ratio is inverted. More filler than substance. | 500-line response to a 2-line question. Most of it is boilerplate, repetition, or irrelevant context. |
+
+## Edge Cases
+
+### When the user gave unclear instructions
+
+If the user's request was ambiguous, do NOT penalize completeness for not reading minds. Instead, note in the evaluation: "User's request was ambiguous about __. I chose interpretation __. If they meant __, this score would drop to __."
+
+### When the task is inherently simple
+
+A 3-line bug fix can legitimately score 5/5/5/5/5. The rubric scales with complexity — a simple task done perfectly IS a 5.0. Don't invent gaps to justify lower scores.
+
+### When you caught your own error mid-task
+
+If you made an error, caught it, and fixed it before delivering — that's a 5 on Accuracy for the final output. The evaluation is about what the user received, not your internal process. Note the self-correction as evidence of thoroughness, not as a penalty.
+
+### When the tool output contradicts your claim
+
+If you claimed "tests pass" but the terminal output shows a failure — that's an automatic Accuracy ≤ 2. Tool output is ground truth. Claims without verification are the most common source of low accuracy scores.
diff --git a/skills/agent-self-evaluation/references/hook-integration.md b/skills/agent-self-evaluation/references/hook-integration.md
new file mode 100644
index 00000000..78246b37
--- /dev/null
+++ b/skills/agent-self-evaluation/references/hook-integration.md
@@ -0,0 +1,59 @@
+# Hook Integration for Session-Stop Self-Evaluation
+
+Add this hook to `hooks/hooks.json` to automatically trigger self-evaluation at the end of every session:
+
+```json
+{
+  "hooks": {
+    "Stop": [
+      {
+        "matcher": "true",
+        "hooks": [
+          {
+            "type": "command",
+            "command": "echo '[Self-Eval] Session complete. Consider running agent-self-evaluation to rate your output.'"
+          }
+        ],
+        "description": "Remind agent to self-evaluate at session end"
+      }
+    ]
+  }
+}
+```
+
+## Integration with the Python Evaluator
+
+The `scripts/evaluate.py` script can be used as a standalone tool:
+
+```bash
+# Pipe agent output directly
+echo "Your agent response here" | python3 skills/agent-self-evaluation/scripts/evaluate.py
+
+# From files
+python3 skills/agent-self-evaluation/scripts/evaluate.py --task task.txt --output response.txt
+```
+
+To integrate it into hooks, capture the last agent output to a file first, then run the evaluator:
+
+```json
+{
+  "PostToolUse": [
+    {
+      "matcher": "tool == \"Bash\" && tool_input.command matches \"(test|pytest|npm test|go test)\"",
+      "hooks": [
+        {
+          "type": "command",
+          "command": "echo '[Self-Eval] Tests completed. Consider running agent-self-evaluation.'"
+        }
+      ],
+      "description": "Remind agent to self-evaluate after test runs"
+    }
+  ]
+}
+```
+
+These hooks are opt-in. Add them to your local `hooks/hooks.json` if you want automated evaluation prompts.
+
+## Manual Usage (Recommended)
+
+The most reliable approach is manual invocation — the agent runs self-evaluation as part of its workflow when the `agent-self-evaluation` skill is active, without requiring hook configuration. The skill's "When to Activate" section already covers trigger conditions (multi-file changes, debugging sessions, design documents).
diff --git a/skills/agent-self-evaluation/scripts/evaluate.py b/skills/agent-self-evaluation/scripts/evaluate.py
new file mode 100755
index 00000000..354f5a7a
--- /dev/null
+++ b/skills/agent-self-evaluation/scripts/evaluate.py
@@ -0,0 +1,371 @@
+#!/usr/bin/env python3
+"""Standalone agent output evaluator using the 5-axis rubric.
+
+Reads a task description and agent output from stdin or files,
+scores each axis, and prints a structured evaluation report.
+
+Usage:
+    # Pipe output directly
+    echo "Task: Add retry logic" | evaluate.py --output response.txt
+
+    # From files
+    evaluate.py --task task.txt --output response.txt
+
+    # Interactive (reads task from prompt, output from stdin)
+    evaluate.py --interactive
+
+The evaluator uses keyword heuristics + structural checks as a first pass.
+For production use, pair with an LLM judge for semantic understanding.
+"""
+
+import argparse
+import re
+import sys
+from dataclasses import dataclass, field
+from typing import Optional
+
+
+@dataclass
+class AxisScore:
+    name: str
+    score: int
+    evidence: list[str] = field(default_factory=list)
+    improvement: Optional[str] = None
+
+
+def count_words(text: str) -> int:
+    return len(text.split())
+
+
+def check_accuracy(text: str) -> AxisScore:
+    """Check for verifiable claims, tool output references, error signs."""
+    evidence = []
+    deductions = 0
+    score = 5
+
+    # Positive signals: verified claims
+    verified_patterns = [
+        (r"(?i)(tests?\s+pass|all\s+tests?\s+passing|\d+\s+passed)", "Tests passing"),
+        (r"(?i)(exit\s+code\s*[:=]?\s*0|exited\s+with\s+0)", "Clean exit code"),
+        (r"(?i)(lint.*clean|no\s+lint\s+errors|0\s+errors)", "Lint clean"),
+        (r"(?i)(verified|confirmed|validated)\s+(with|against|using|by)", "Explicit verification"),
+        (r"(?i)(grep|rg)\s+.*\b(found|matched|returned)", "Grep confirmed"),
+    ]
+    for pattern, label in verified_patterns:
+        if re.search(pattern, text):
+            evidence.append(f"+ {label}")
+
+    # Negative signals: unverified claims
+    danger_patterns = [
+        (r"(?i)(should\s+work|probably\s+fine|should\s+be\s+ok)", "Hedged claim without verification"),
+        (r"(?i)(I\s+think|I\s+believe|I\s+assume|might\s+be)", "Speculation without evidence"),
+        (r"(?i)(untested|not\s+tested|haven'?t\s+tested)", "Explicitly untested"),
+        (r"(?i)(TODO|FIXME|HACK|WORKAROUND)", "Unresolved TODO/FIXME"),
+    ]
+    for pattern, label in danger_patterns:
+        if re.search(pattern, text):
+            deductions += 1
+            evidence.append(f"- {label}")
+
+    if deductions >= 3:
+        score = 2
+    elif deductions == 2:
+        score = 3
+    elif deductions == 1:
+        score = 4
+
+    if not evidence:
+        evidence.append("No verification signals detected — score assumes correctness")
+
+    result = AxisScore(name="Accuracy", score=score, evidence=evidence)
+    if score < 5:
+        result.improvement = "Cite specific tool outputs (test results, exit codes, grep findings) to back claims"
+    return result
+
+
+def check_completeness(text: str, task: Optional[str] = None) -> AxisScore:
+    """Check for requirement coverage, edge cases, error handling."""
+    evidence = []
+    score = 5
+
+    # Positive signals
+    completeness_signals = [
+        (r"(?i)(edge\s*cases?|corner\s*cases?)", "Edge cases addressed"),
+        (r"(?i)(error\s*handling|exception\s*handling|try/except|try\s*{)", "Error handling present"),
+        (r"(?i)(all\s+\w+\s+(methods|endpoints|routes))", "Full coverage claimed"),
+        (r"(?i)(verification|verified\s+that|confirmed\s+that)", "Verification step present"),
+    ]
+    for pattern, label in completeness_signals:
+        if re.search(pattern, text):
+            evidence.append(f"+ {label}")
+
+    # Gaps
+    gap_signals = [
+        (r"(?i)(not\s+covered|not\s+handled|out\s+of\s+scope)", "Explicit gap acknowledged"),
+        (r"(?i)(only\s+(works|handles|supports)\s+\w+)", "Limited scope noted"),
+        (r"(?i)(assume[sd]?\s+that|assuming\s+the)", "Assumption without verification"),
+    ]
+    deductions = 0
+    for pattern, label in gap_signals:
+        if re.search(pattern, text):
+            deductions += 1
+            evidence.append(f"- {label}")
+
+    if deductions >= 2:
+        score = 3
+    elif deductions == 1:
+        score = 4
+
+    if not evidence:
+        evidence.append("No completeness signals — unable to assess coverage")
+
+    result = AxisScore(name="Completeness", score=score, evidence=evidence)
+    if score < 5:
+        result.improvement = "List what was covered AND what was intentionally excluded, with reasoning"
+    return result
+
+
+def check_clarity(text: str) -> AxisScore:
+    """Check for structure, readability, jargon handling."""
+    evidence = []
+    score = 5
+    deductions = 0
+
+    # Positive signals
+    if re.search(r"^#{1,3}\s+", text, re.MULTILINE):
+        evidence.append("+ Uses headings for structure")
+    if re.search(r"```", text):
+        evidence.append("+ Uses code blocks")
+    if re.search(r"^\s*[-*]\s+", text, re.MULTILINE):
+        evidence.append("+ Uses bullet points")
+
+    # Negative signals
+    # Wall of text: long paragraph without breaks
+    paragraphs = [p for p in text.split("\n\n") if p.strip()]
+    for p in paragraphs:
+        if count_words(p) > 200:
+            deductions += 1
+            evidence.append("- Wall-of-text paragraph (>200 words without break)")
+            break
+
+    # Jargon without definition
+    jargon = [
+        (r"\b(idempotent|race condition|deadlock|thundering herd)\b", "concurrency"),
+        (r"\b(exponential backoff|circuit breaker|bulkhead)\b", "resilience"),
+        (r"\b(ACID|CAP|eventual consistency|linearizability)\b", "database theory"),
+    ]
+    for pattern, domain in jargon:
+        if re.search(pattern, text, re.IGNORECASE):
+            if not re.search(rf"(?i)({domain}|means|refers to|i\.e\.|in other words)", text):
+                deductions += 1
+                evidence.append(f"- Domain term used without explanation ({domain})")
+                break
+
+    if not any(t in text[:100].lower() for t in ["summary", "tldr", "overview", "in short"]):
+        # No early summary — penalize only if text is long
+        if count_words(text) > 300:
+            deductions += 1
+            evidence.append("- No summary/TLDR in first 100 words (text is 300+ words)")
+
+    if deductions >= 3:
+        score = 2
+    elif deductions == 2:
+        score = 3
+    elif deductions == 1:
+        score = 4
+
+    if not evidence:
+        evidence.append("+ Well-structured with no clarity issues detected")
+
+    result = AxisScore(name="Clarity", score=score, evidence=evidence)
+    if score < 5:
+        result.improvement = "Add headings, break long paragraphs, define domain terms on first use"
+    return result
+
+
+def check_actionability(text: str) -> AxisScore:
+    """Check if the user can act on the output immediately."""
+    evidence = []
+    score = 5
+    deductions = 0
+
+    # Positive signals
+    actionable_signals = [
+        (r"(?i)(merge|PR|pull request).*?(created|ready|open)", "PR created"),
+        (r"(?i)(run|execute)\s+[`\"']?[\w./-]+", "Specific run command given"),
+        (r"(?i)(next\s+steps?|follow[- ]up|what\s+to\s+do)", "Next steps provided"),
+        (r"(?i)(file\s+(created|written|modified|updated)\s+at)", "File path specified"),
+    ]
+    for pattern, label in actionable_signals:
+        if re.search(pattern, text):
+            evidence.append(f"+ {label}")
+
+    # Negative signals
+    vague_signals = [
+        (r"(?i)(you\s+(should|could|might\s+want\s+to))\s+\w+", "Vague suggestion without specifics"),
+        (r"(?i)(consider|maybe|perhaps)\s+\w+ing", "Non-committal suggestion"),
+        (r"(?i)(figure\s+out|look\s+into|investigate)\s", "Defers work to user"),
+    ]
+    for pattern, label in vague_signals:
+        if re.search(pattern, text):
+            deductions += 1
+            evidence.append(f"- {label}")
+
+    if deductions >= 3:
+        score = 2
+    elif deductions == 2:
+        score = 3
+    elif deductions == 1:
+        score = 4
+
+    if not evidence:
+        evidence.append("No actionability signals — user may need to ask 'what now?'")
+
+    result = AxisScore(name="Actionability", score=score, evidence=evidence)
+    if score < 5:
+        result.improvement = "End with a single clear action: 'Merge this PR', 'Run ./deploy.sh', or 'Review the 3 changed files'"
+    return result
+
+
+def check_concision(text: str, task: Optional[str] = None) -> AxisScore:
+    """Check for redundancy, filler, information density."""
+    evidence = []
+    score = 5
+    wc = count_words(text)
+
+    # Heuristic: task-to-output ratio
+    if task:
+        task_wc = count_words(task)
+        ratio = wc / max(task_wc, 1)
+        if ratio > 15:
+            evidence.append(f"- Output is {ratio:.0f}x longer than task description (high ratio)")
+            score = min(score, 3)
+        elif ratio > 8:
+            evidence.append(f"- Output is {ratio:.0f}x longer than task description")
+            score = min(score, 4)
+
+    # Redundancy signals
+    redundancy_checks = [
+        (r"(?i)(as\s+(I|we)\s+(mentioned|said|noted|discussed)\s+(earlier|above|before))",
+         "Refers back to earlier statement (possible repetition)"),
+        (r"(?i)(to\s+summarize|in\s+summary|in\s+conclusion|to\s+conclude)",
+         "Has explicit summary (good if needed, flag if redundant)"),
+        (r"(?i)(let\s+me\s+(explain|break\s+this\s+down|walk\s+you\s+through))",
+         "Meta-commentary adds words without information"),
+    ]
+    redundant_count = 0
+    for pattern, label in redundancy_checks:
+        matches = re.findall(pattern, text)
+        if len(matches) > 2:
+            redundant_count += 1
+            evidence.append(f"- '{label}' appears {len(matches)} times")
+
+    if redundant_count >= 2:
+        score = min(score, 3)
+    elif redundant_count == 1:
+        score = min(score, 4)
+
+    if not evidence and score == 5:
+        evidence.append("+ No redundancy detected. Information density appears good.")
+
+    result = AxisScore(name="Conciseness", score=score, evidence=evidence)
+    if score < 5:
+        result.improvement = "Cut meta-commentary, remove repeated points, trim examples to one representative case"
+    return result
+
+
+def evaluate(task: Optional[str], output: str) -> list[AxisScore]:
+    """Run all 5 axis checks and return scored results."""
+    return [
+        check_accuracy(output),
+        check_completeness(output, task),
+        check_clarity(output),
+        check_actionability(output),
+        check_concision(output, task),
+    ]
+
+
+def format_report(scores: list[AxisScore]) -> str:
+    """Format scores into a readable evaluation report."""
+    avg = sum(s.score for s in scores) / len(scores)
+    lines = []
+    lines.append("=" * 60)
+    lines.append("AGENT SELF-EVALUATION REPORT")
+    lines.append("=" * 60)
+    lines.append("")
+
+    for s in scores:
+        bar = "█" * s.score + "░" * (5 - s.score)
+        lines.append(f"  {s.name:<15} {bar} {s.score}/5")
+        for e in s.evidence:
+            lines.append(f"    {e}")
+        if s.improvement:
+            lines.append(f"    → {s.improvement}")
+        lines.append("")
+
+    lines.append(f"  {'OVERALL':<15} {avg:.1f}/5")
+    lines.append("")
+
+    # Top improvements
+    improvements = [(s, s.improvement) for s in scores if s.improvement and s.score < 4]
+    if improvements:
+        lines.append("TOP IMPROVEMENTS (axes scoring < 4):")
+        for s, imp in sorted(improvements, key=lambda x: x[0].score):
+            lines.append(f"  [{s.name}] {imp}")
+    else:
+        lines.append("No axes below 4. Strong output across all dimensions.")
+
+    return "\n".join(lines)
+
+
+def main():
+    parser = argparse.ArgumentParser(
+        description="Evaluate agent output against the 5-axis rubric"
+    )
+    parser.add_argument("--task", help="Task description (file path or inline text)")
+    parser.add_argument("--output", help="Agent output to evaluate (file path)")
+    parser.add_argument("--interactive", action="store_true", help="Prompt for task and read output from stdin")
+    args = parser.parse_args()
+
+    task = None
+    output = None
+
+    if args.interactive:
+        task = input("Task description: ").strip()
+        print("Paste agent output (Ctrl+D to finish):")
+        output = sys.stdin.read()
+    elif args.task and args.output:
+        # Read task
+        try:
+            with open(args.task) as f:
+                task = f.read()
+        except FileNotFoundError:
+            task = args.task  # Treat as inline text
+
+        # Read output
+        try:
+            with open(args.output) as f:
+                output = f.read()
+        except FileNotFoundError:
+            print(f"Error: output file '{args.output}' not found", file=sys.stderr)
+            sys.exit(1)
+    else:
+        # Pipe mode: read output from stdin
+        output = sys.stdin.read()
+        if args.task:
+            try:
+                with open(args.task) as f:
+                    task = f.read()
+            except FileNotFoundError:
+                task = args.task
+
+    if not output:
+        print("Error: no output to evaluate", file=sys.stderr)
+        sys.exit(1)
+
+    scores = evaluate(task, output)
+    print(format_report(scores))
+
+
+if __name__ == "__main__":
+    main()
diff --git a/skills/agent-self-evaluation/templates/evaluation-report.md b/skills/agent-self-evaluation/templates/evaluation-report.md
new file mode 100644
index 00000000..ce29f1ce
--- /dev/null
+++ b/skills/agent-self-evaluation/templates/evaluation-report.md
@@ -0,0 +1,70 @@
+# Agent Self-Evaluation Report Template
+
+Copy this template and fill in after completing a task. Keep it in your conversation — the user sees it inline.
+
+```
+============================================================
+AGENT SELF-EVALUATION REPORT
+============================================================
+
+  Accuracy         █████ 5/5    or    ███░░ 3/5
+    + [Evidence: passing tests, verified claims]
+    - [Gaps: unverified claims, hedging language]
+
+  Completeness      █████ 5/5
+    + [What's covered: all requirements + edge cases]
+    - [What's missing: explicitly acknowledge gaps]
+
+  Clarity           █████ 5/5
+    + [Structure: headings, code blocks, bullet points]
+    - [Issues: undefined terms, wall of text, no summary]
+
+  Actionability     █████ 5/5
+    + [User can: merge PR, run command, review file]
+    - [Blockers: missing steps, vague suggestions]
+
+  Conciseness       █████ 5/5
+    + [Tight: no repetition, high information density]
+    - [Bloat: filler, meta-commentary, repeated points]
+
+  OVERALL           X.X/5
+
+TOP IMPROVEMENTS:
+  [Only list axes scoring < 4, ranked by user impact]
+```
+
+## Quick Reference: Scoring Triggers
+
+| If you see this... | Accuracy | Completeness | Clarity | Actionability | Conciseness |
+|---|---|---|---|---|---|
+| "should work" / "probably fine" | ≤4 | — | — | — | — |
+| "I think" / "I believe" | ≤4 | — | — | — | — |
+| No test output cited | ≤4 | — | — | — | — |
+| "TODO" / "FIXME" left behind | ≤3 | ≤3 | — | ≤3 | — |
+| Missing error handling | — | ≤3 | — | — | — |
+| Only happy path covered | — | ≤3 | — | — | — |
+| Wall-of-text paragraph (>200 words) | — | — | ≤3 | — | — |
+| No headings or structure | — | — | ≤3 | — | — |
+| "You should..." without specifics | — | — | — | ≤3 | — |
+| No PR or file created | — | — | — | ≤3 | — |
+| User needs to figure out next step | — | — | — | ≤2 | — |
+| Repeated points (3+ times) | — | — | — | — | ≤3 |
+| "Let me explain..." / "To summarize..." x3+ | — | — | — | — | ≤3 |
+| Output >15x longer than task | — | — | — | — | ≤3 |
+
+## When to Skip
+
+Skip the evaluation if:
+- Task was a single tool call (e.g., "read this file" — nothing to evaluate)
+- User explicitly says "don't evaluate" or "just do it"
+- Task is purely conversational (greeting, small talk)
+- You're mid-workflow and the user will judge the final output, not intermediate steps
+
+## Post-Evaluation Actions
+
+| Overall Score | What to do |
+|---|---|
+| ≥4.5 | Deliver. No changes needed. |
+| 3.5–4.4 | Flag the top improvement but deliver. Fix if <30 seconds. |
+| 2.5–3.4 | State what you'd change. Ask user: "Should I redo [axis] or deliver as-is?" |
+| <2.5 | Don't deliver. Say: "This scored __ because __. Let me redo this with [specific fix]." Then redo. |

From d0a84db17781bf963e1c6484e1fc4be48564e0c4 Mon Sep 17 00:00:00 2001
From: Hawthorn <217181565+lamenting-hawthorn@users.noreply.github.com>
Date: Wed, 10 Jun 2026 17:08:31 +0530
Subject: [PATCH 02/10] Update agents/agent-evaluator.md

Co-authored-by: coderabbitai[bot] <136622811+coderabbitai[bot]@users.noreply.github.com>
---
 agents/agent-evaluator.md | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/agents/agent-evaluator.md b/agents/agent-evaluator.md
index dbb7a904..fba475f7 100644
--- a/agents/agent-evaluator.md
+++ b/agents/agent-evaluator.md
@@ -1,6 +1,6 @@
 ---
 name: agent-evaluator
-description: Evaluates agent output against 5-axis quality rubric (accuracy, completeness, clarity, actionability, concision). Use after any non-trivial task when the user wants a quality assessment, or when the agent-self-evaluation skill is active. Produces structured scorecard with evidence and improvement suggestions.
+description: Evaluates agent output against 5-axis quality rubric (accuracy, completeness, clarity, actionability, conciseness). Use after any non-trivial task when the user wants a quality assessment, or when the agent-self-evaluation skill is active. Produces structured scorecard with evidence and improvement suggestions.
 tools: ["Read", "Grep", "Glob", "Bash"]
 model: sonnet
 ---

From c0f651cf85eacc9064b16e117c0355b307f47721 Mon Sep 17 00:00:00 2001
From: Hawthorn <rv.help23@gmail.com>
Date: Wed, 10 Jun 2026 17:11:36 +0530
Subject: [PATCH 03/10] fix: align report format across evaluate.py, agent
 spec, and template
MIME-Version: 1.0
Content-Type: text/plain; charset=UTF-8
Content-Transfer-Encoding: 8bit

- evaluate.py: add CRITICAL ISSUES (axes ≤ 2) section, VERDICT line
- agent-evaluator.md: match format_report output exactly (title, evidence markers, bar graphs)
- templates/evaluation-report.md: match evaluate.py output format
- All now produce identical AGENT SELF-EVALUATION REPORT structure

Single authoritative format: evaluate.py's format_report() output.
---
 agents/agent-evaluator.md                     | 149 ++++++++++++------
 .../agent-self-evaluation/scripts/evaluate.py |  40 ++++-
 .../templates/evaluation-report.md            |  21 ++-
 3 files changed, 147 insertions(+), 63 deletions(-)

diff --git a/agents/agent-evaluator.md b/agents/agent-evaluator.md
index fba475f7..f4b90a9b 100644
--- a/agents/agent-evaluator.md
+++ b/agents/agent-evaluator.md
@@ -54,37 +54,49 @@ For each axis:
 
 ### Step 4: Produce Report
 
-Use this format:
+Use this exact format (matches `scripts/evaluate.py` output):
 
 ```
 ============================================================
-AGENT EVALUATION REPORT
+AGENT SELF-EVALUATION REPORT
 ============================================================
 
-  Axis            Score   Evidence
+  Accuracy         █████ 5/5
+    + [Evidence: passing tests, verified claims]
+    → [Improvement if score < 5]
 
-  Accuracy         X/5    [What was verified, what was wrong]
-  Completeness     X/5    [What's covered, what's missing]
-  Clarity          X/5    [Structure quality, readability]
-  Actionability    X/5    [Can user act now? What's the next step?]
-  Conciseness      X/5    [Information density, redundancy]
+  Completeness      █████ 5/5
+    + [What's covered]
+    → [Improvement if score < 5]
 
-  OVERALL          X.X/5
+  Clarity           █████ 5/5
+    + [Structure signals]
+    → [Improvement if score < 5]
+
+  Actionability     █████ 5/5
+    + [User can act immediately]
+    → [Improvement if score < 5]
+
+  Conciseness       █████ 5/5
+    + [Information density]
+    → [Improvement if score < 5]
+
+  OVERALL           X.X/5
 
 CRITICAL ISSUES (axes ≤ 2):
-  [If any axis scored 2 or below, list it here with the specific fix needed]
+  [Axis] Score N/5 — specific fix needed
+  (or "None" if no axis ≤ 2)
 
 TOP IMPROVEMENTS:
-  1. [Highest impact fix first]
+  1. [Highest impact fix]
   2. [Second highest]
-  3. [Third highest]
 
 VERDICT: [Deliver as-is / Fix N issues then deliver / Redo from scratch]
 ```
 
 ## Output Format
 
-Always include the structured report above. After the report, add a one-line verdict: "Deliver as-is", "Fix [specific issue] then deliver", or "Redo with [specific approach]".
+Always include the structured report above, matching the `scripts/evaluate.py` output format exactly. The report title is "AGENT SELF-EVALUATION REPORT" (not "AGENT EVALUATION REPORT").
 
 ## Examples
 
@@ -93,26 +105,44 @@ Always include the structured report above. After the report, add a one-line ver
 Task: Add retry logic to HTTP client. 3 retries, exponential backoff.
 
 ```
-AGENT EVALUATION REPORT
+============================================================
+AGENT SELF-EVALUATION REPORT
+============================================================
 
-  Accuracy         5/5    grep confirms httpx.Retry used correctly.
-                          Tests pass (42/42). Import verified.
-  Completeness      4/5    All HTTP methods covered. Missing: connection
-                          pool exhaustion handling (minor edge case).
-  Clarity           5/5    Well-structured. Summary, code blocks, bullet
-                          points. 10-second scan tells the full story.
-  Actionability     5/5    Single PR (#423). `pytest -v` cited. Merge is
-                          the only action needed.
-  Conciseness       4/5    250 words. Verification section slightly
-                          verbose — 3 commands could be 1 script.
+  Accuracy         █████ 5/5
+    + Tests passing
+    + grep confirms httpx.Retry used correctly
+    + Import verified
 
-  OVERALL          4.6/5
+  Completeness      ████░ 4/5
+    + All HTTP methods covered
+    + Edge cases documented
+    → Missing: connection pool exhaustion handling (minor edge case)
+
+  Clarity           █████ 5/5
+    + Uses headings for structure
+    + Summary in first 3 lines
+    + Code blocks with language tags
+
+  Actionability     █████ 5/5
+    + PR #423 created
+    + pytest -v cited (42 passed)
+    + Single action: merge PR
+
+  Conciseness       ████░ 4/5
+    + 250 words, high density
+    → Verification section slightly verbose — 3 commands could be 1 script
+
+  OVERALL           4.6/5
+
+CRITICAL ISSUES (axes ≤ 2):
+  None
 
 TOP IMPROVEMENTS:
-  1. Add connection pool exhaustion to edge cases doc
-  2. Consolidate verification commands into a single script
+  1. [Completeness] Add connection pool exhaustion to edge cases doc
+  2. [Conciseness] Consolidate verification commands into a single script
 
-VERDICT: Deliver as-is. The one gap (pool exhaustion) is a P2 edge case.
+VERDICT: Deliver as-is. Minor improvements noted above.
 ```
 
 ### Example: Weak Output
@@ -120,33 +150,48 @@ VERDICT: Deliver as-is. The one gap (pool exhaustion) is a P2 edge case.
 Task: Same as above.
 
 ```
-AGENT EVALUATION REPORT
+============================================================
+AGENT SELF-EVALUATION REPORT
+============================================================
 
-  Accuracy         2/5    CRITICAL: Agent used urllib3.Retry but project
-                          uses httpx. grep proves no urllib3 import exists.
-                          Hedging language: "I think", "probably fine".
-  Completeness      3/5    Only handles 5xx. Missing: 429 rate limiting,
-                          connection timeouts. Agent acknowledges gaps
-                          ("might be edge cases") but doesn't fix them.
-  Clarity           3/5    Code is readable but no explanation of where
-                          to integrate. "Add this somewhere" is vague.
-  Actionability     2/5    No PR, no file created, no test written.
-                          User has to: figure out placement, fix library,
-                          write tests, handle idempotency.
-  Conciseness       3/5    120 words but ~50% is hedging/disclaimers.
-                          Low information density.
+  Accuracy         ██░░░ 2/5
+    + Code block present
+    - Hedged claim without verification ("I think this should work")
+    - Explicitly untested
+    - Speculation without evidence
+    → Cite specific tool outputs (test results, exit codes, grep findings)
 
-  OVERALL          2.6/5
+  Completeness      ███░░ 3/5
+    + Provides code example
+    - Explicit gap acknowledged ("might be edge cases with POST")
+    - Limited scope noted (only 5xx, missing 429 and connection errors)
+    → List what's covered AND what's intentionally excluded
 
-CRITICAL ISSUES:
-  Accuracy: Wrong library. Use httpx.Retry, not urllib3.Retry.
-  Actionability: No deliverable. Create a PR with the changed file + tests.
+  Clarity           ████░ 4/5
+    + Uses code blocks
+    - No integration guidance ("add this somewhere" is vague)
+    → Specify exact file and line where code should be added
+
+  Actionability     ██░░░ 2/5
+    - Defers work to user ("you'll want to test this")
+    - Vague suggestion without specifics
+    → Create a PR with the changed file + tests
+
+  Conciseness       ███░░ 3/5
+    + Short (120 words)
+    - Low information density (~50% hedging/disclaimers)
+    → Cut meta-commentary and filler
+
+  OVERALL           2.8/5
+
+CRITICAL ISSUES (axes ≤ 2):
+  [Accuracy] Score 2/5 — Wrong library. Use httpx.Retry, not urllib3.Retry.
+  [Actionability] Score 2/5 — No deliverable. Create a PR with test file.
 
 TOP IMPROVEMENTS:
-  1. Switch to httpx.Retry — grep the codebase first to confirm the HTTP library
-  2. Create a PR with src/api_client.py + tests/test_api_client.py
-  3. Handle 429, connection errors, and timeout — not just 5xx
+  1. [Accuracy] Switch to httpx.Retry — grep the codebase first
+  2. [Actionability] Create a PR with src/api_client.py + tests
+  3. [Completeness] Handle 429, connection errors, and timeout
 
-VERDICT: Redo with httpx.Retry, full HTTP method coverage, and a test file.
-  Do not deliver until accuracy ≥ 4.
-```
+VERDICT: Redo with specific fixes. Weakest axis: Accuracy (2/5).
+```
\ No newline at end of file
diff --git a/skills/agent-self-evaluation/scripts/evaluate.py b/skills/agent-self-evaluation/scripts/evaluate.py
index 354f5a7a..0446106b 100755
--- a/skills/agent-self-evaluation/scripts/evaluate.py
+++ b/skills/agent-self-evaluation/scripts/evaluate.py
@@ -306,14 +306,40 @@ def format_report(scores: list[AxisScore]) -> str:
     lines.append(f"  {'OVERALL':<15} {avg:.1f}/5")
     lines.append("")
 
-    # Top improvements
-    improvements = [(s, s.improvement) for s in scores if s.improvement and s.score < 4]
-    if improvements:
-        lines.append("TOP IMPROVEMENTS (axes scoring < 4):")
-        for s, imp in sorted(improvements, key=lambda x: x[0].score):
-            lines.append(f"  [{s.name}] {imp}")
+    # Critical issues (axes ≤ 2)
+    critical = [(s, s.improvement or "No improvement suggested") for s in scores if s.score <= 2]
+    lines.append("CRITICAL ISSUES (axes ≤ 2):")
+    if critical:
+        for s, imp in critical:
+            lines.append(f"  [{s.name}] Score {s.score}/5 — {imp}")
     else:
-        lines.append("No axes below 4. Strong output across all dimensions.")
+        lines.append("  None")
+
+    lines.append("")
+
+    # Top improvements (axes scoring < 4, ranked by impact)
+    improvements = [(s, s.improvement) for s in scores if s.improvement and s.score < 4]
+    lines.append("TOP IMPROVEMENTS:")
+    if improvements:
+        for i, (s, imp) in enumerate(sorted(improvements, key=lambda x: x[0].score), 1):
+            lines.append(f"  {i}. [{s.name}] {imp}")
+    else:
+        lines.append("  No axes below 4. Strong output across all dimensions.")
+
+    lines.append("")
+
+    # Verdict
+    min_score = min(s.score for s in scores)
+    if min_score <= 2:
+        verdict = f"Redo with specific fixes. Weakest axis: {min(scores, key=lambda s: s.score).name} ({min_score}/5)."
+    elif any(s.score <= 3 for s in scores):
+        weak = [s.name for s in scores if s.score <= 3]
+        verdict = f"Fix {'/'.join(weak)} issues, then deliver."
+    elif avg >= 4.5:
+        verdict = "Deliver as-is. No changes needed."
+    else:
+        verdict = "Deliver as-is. Minor improvements noted above."
+    lines.append(f"VERDICT: {verdict}")
 
     return "\n".join(lines)
 
diff --git a/skills/agent-self-evaluation/templates/evaluation-report.md b/skills/agent-self-evaluation/templates/evaluation-report.md
index ce29f1ce..ee0513e2 100644
--- a/skills/agent-self-evaluation/templates/evaluation-report.md
+++ b/skills/agent-self-evaluation/templates/evaluation-report.md
@@ -1,6 +1,6 @@
 # Agent Self-Evaluation Report Template
 
-Copy this template and fill in after completing a task. Keep it in your conversation — the user sees it inline.
+Copy this template and fill in after completing a task. The format matches `scripts/evaluate.py` output.
 
 ```
 ============================================================
@@ -10,27 +10,40 @@ AGENT SELF-EVALUATION REPORT
   Accuracy         █████ 5/5    or    ███░░ 3/5
     + [Evidence: passing tests, verified claims]
     - [Gaps: unverified claims, hedging language]
+    → [Improvement if score < 5]
 
   Completeness      █████ 5/5
     + [What's covered: all requirements + edge cases]
     - [What's missing: explicitly acknowledge gaps]
+    → [Improvement if score < 5]
 
   Clarity           █████ 5/5
     + [Structure: headings, code blocks, bullet points]
     - [Issues: undefined terms, wall of text, no summary]
+    → [Improvement if score < 5]
 
   Actionability     █████ 5/5
     + [User can: merge PR, run command, review file]
     - [Blockers: missing steps, vague suggestions]
+    → [Improvement if score < 5]
 
   Conciseness       █████ 5/5
     + [Tight: no repetition, high information density]
     - [Bloat: filler, meta-commentary, repeated points]
+    → [Improvement if score < 5]
 
   OVERALL           X.X/5
 
+CRITICAL ISSUES (axes ≤ 2):
+  [Axis] Score N/5 — specific fix needed
+  (or "None" if no axis ≤ 2)
+
 TOP IMPROVEMENTS:
-  [Only list axes scoring < 4, ranked by user impact]
+  1. [Highest impact fix]
+  2. [Second highest]
+  (Only list axes scoring < 4, ranked by user impact)
+
+VERDICT: [Deliver as-is / Fix N issues then deliver / Redo from scratch]
 ```
 
 ## Quick Reference: Scoring Triggers
@@ -64,7 +77,7 @@ Skip the evaluation if:
 
 | Overall Score | What to do |
 |---|---|
-| ≥4.5 | Deliver. No changes needed. |
-| 3.5–4.4 | Flag the top improvement but deliver. Fix if <30 seconds. |
+| ≥4.5 | Deliver as-is. No changes needed. |
+| 3.5–4.4 | Flag top improvement but deliver. Fix if <30 seconds. |
 | 2.5–3.4 | State what you'd change. Ask user: "Should I redo [axis] or deliver as-is?" |
 | <2.5 | Don't deliver. Say: "This scored __ because __. Let me redo this with [specific fix]." Then redo. |

From 2ea4d779a3693db64061ff6d74d587294a1db320 Mon Sep 17 00:00:00 2001
From: Hawthorn <rv.help23@gmail.com>
Date: Wed, 10 Jun 2026 17:25:24 +0530
Subject: [PATCH 04/10] fix: address self-evaluation review comments

- Clarify that agent-evaluator reads skills/agent-self-evaluation/SKILL.md directly
- Standardize on Conciseness terminology, including helper names
- Remove invalid Stop hook matcher and avoid unsupported command-expression matcher examples
- Add explicit hook-integration reference path in SKILL.md
- Add summary and self-check fields to evaluate.py output, template, and agent spec
- Refactor evaluate.py clarity and input parsing helpers
- Remove unused task parameter from check_completeness

Validation:
- python3 -m py_compile skills/agent-self-evaluation/scripts/evaluate.py
- evaluate.py high/low example smoke tests
- node scripts/ci/validate-agents.js
- node scripts/ci/validate-skills.js
- node scripts/ci/validate-hooks.js
- node scripts/ci/validate-no-personal-paths.js
---
 agents/agent-evaluator.md                     |  13 +-
 skills/agent-self-evaluation/SKILL.md         |   2 +-
 .../references/hook-integration.md            |  15 +-
 .../agent-self-evaluation/scripts/evaluate.py | 130 +++++++++---------
 .../templates/evaluation-report.md            |   3 +
 5 files changed, 91 insertions(+), 72 deletions(-)

diff --git a/agents/agent-evaluator.md b/agents/agent-evaluator.md
index f4b90a9b..3169382e 100644
--- a/agents/agent-evaluator.md
+++ b/agents/agent-evaluator.md
@@ -13,7 +13,7 @@ You are a quality evaluator for AI agent output. Your job is to assess agent res
 - Every score below 5 MUST cite specific evidence from the output
 - Provide concrete, actionable improvement suggestions
 - Maintain objectivity — evaluate the output, not the agent's effort or intent
-- Load the `agent-self-evaluation` skill for the detailed scoring rubric
+- Read `skills/agent-self-evaluation/SKILL.md` for the detailed scoring rubric. Example input is a standard ECC `SKILL.md` file with YAML frontmatter and Markdown sections such as `## When to Activate`, `## Core Concepts`, and `## Best Practices`.
 
 - DO NOT re-perform the original task
 - DO NOT suggest alternative approaches unless the current approach is factually wrong
@@ -60,6 +60,7 @@ Use this exact format (matches `scripts/evaluate.py` output):
 ============================================================
 AGENT SELF-EVALUATION REPORT
 ============================================================
+Summary: Overall score X.X/5 across 5 quality axes.
 
   Accuracy         █████ 5/5
     + [Evidence: passing tests, verified claims]
@@ -87,6 +88,8 @@ CRITICAL ISSUES (axes ≤ 2):
   [Axis] Score N/5 — specific fix needed
   (or "None" if no axis ≤ 2)
 
+Self-check: Would the user agree with this assessment? [Yes/No + brief justification]
+
 TOP IMPROVEMENTS:
   1. [Highest impact fix]
   2. [Second highest]
@@ -96,7 +99,7 @@ VERDICT: [Deliver as-is / Fix N issues then deliver / Redo from scratch]
 
 ## Output Format
 
-Always include the structured report above, matching the `scripts/evaluate.py` output format exactly. The report title is "AGENT SELF-EVALUATION REPORT" (not "AGENT EVALUATION REPORT").
+Always include the structured report above, matching the `scripts/evaluate.py` output format exactly. The report title is "AGENT SELF-EVALUATION REPORT".
 
 ## Examples
 
@@ -108,6 +111,7 @@ Task: Add retry logic to HTTP client. 3 retries, exponential backoff.
 ============================================================
 AGENT SELF-EVALUATION REPORT
 ============================================================
+Summary: Overall score X.X/5 across 5 quality axes.
 
   Accuracy         █████ 5/5
     + Tests passing
@@ -138,6 +142,8 @@ AGENT SELF-EVALUATION REPORT
 CRITICAL ISSUES (axes ≤ 2):
   None
 
+Self-check: Would the user agree with this assessment? Yes — the scores cite passing tests, grep verification, and the remaining gaps are minor.
+
 TOP IMPROVEMENTS:
   1. [Completeness] Add connection pool exhaustion to edge cases doc
   2. [Conciseness] Consolidate verification commands into a single script
@@ -153,6 +159,7 @@ Task: Same as above.
 ============================================================
 AGENT SELF-EVALUATION REPORT
 ============================================================
+Summary: Overall score X.X/5 across 5 quality axes.
 
   Accuracy         ██░░░ 2/5
     + Code block present
@@ -188,6 +195,8 @@ CRITICAL ISSUES (axes ≤ 2):
   [Accuracy] Score 2/5 — Wrong library. Use httpx.Retry, not urllib3.Retry.
   [Actionability] Score 2/5 — No deliverable. Create a PR with test file.
 
+Self-check: Would the user agree with this assessment? Yes — the report cites the wrong library, lack of tests, and missing deliverable.
+
 TOP IMPROVEMENTS:
   1. [Accuracy] Switch to httpx.Retry — grep the codebase first
   2. [Actionability] Create a PR with src/api_client.py + tests
diff --git a/skills/agent-self-evaluation/SKILL.md b/skills/agent-self-evaluation/SKILL.md
index 0aa3c986..96edc164 100644
--- a/skills/agent-self-evaluation/SKILL.md
+++ b/skills/agent-self-evaluation/SKILL.md
@@ -15,7 +15,7 @@ After completing a complex task, the agent pauses to rate its own output against
 - After a debugging session that involved 3+ attempts
 - After producing a design document, architecture decision, or written analysis
 - When the user asks "how good was that?" or "rate yourself"
-- At the end of any session Stop hook (if configured — see References)
+- At the end of any session Stop hook (if configured — see `references/hook-integration.md`)
 
 ## Core Concepts
 
diff --git a/skills/agent-self-evaluation/references/hook-integration.md b/skills/agent-self-evaluation/references/hook-integration.md
index 78246b37..260de2ca 100644
--- a/skills/agent-self-evaluation/references/hook-integration.md
+++ b/skills/agent-self-evaluation/references/hook-integration.md
@@ -1,13 +1,12 @@
 # Hook Integration for Session-Stop Self-Evaluation
 
-Add this hook to `hooks/hooks.json` to automatically trigger self-evaluation at the end of every session:
+Add this hook to `hooks/hooks.json` to remind the agent to self-evaluate at the end of every session:
 
 ```json
 {
   "hooks": {
     "Stop": [
       {
-        "matcher": "true",
         "hooks": [
           {
             "type": "command",
@@ -21,6 +20,8 @@ Add this hook to `hooks/hooks.json` to automatically trigger self-evaluation at
 }
 ```
 
+`Stop` events do not use a `matcher` field. Keep the hook object limited to `hooks` and metadata such as `description`.
+
 ## Integration with the Python Evaluator
 
 The `scripts/evaluate.py` script can be used as a standalone tool:
@@ -33,25 +34,27 @@ echo "Your agent response here" | python3 skills/agent-self-evaluation/scripts/e
 python3 skills/agent-self-evaluation/scripts/evaluate.py --task task.txt --output response.txt
 ```
 
-To integrate it into hooks, capture the last agent output to a file first, then run the evaluator:
+To integrate it into hooks, capture the last agent output to a file first, then run the evaluator. For lightweight reminders after shell-based verification, use a simple supported matcher string:
 
 ```json
 {
   "PostToolUse": [
     {
-      "matcher": "tool == \"Bash\" && tool_input.command matches \"(test|pytest|npm test|go test)\"",
+      "matcher": "Bash",
       "hooks": [
         {
           "type": "command",
-          "command": "echo '[Self-Eval] Tests completed. Consider running agent-self-evaluation.'"
+          "command": "echo '[Self-Eval] If this command completed verification for a non-trivial task, consider running agent-self-evaluation.'"
         }
       ],
-      "description": "Remind agent to self-evaluate after test runs"
+      "description": "Remind agent to self-evaluate after shell verification"
     }
   ]
 }
 ```
 
+This avoids documenting unsupported command-expression matcher syntax. If your harness supports command-level matcher expressions, prefer a word-boundary regex such as `\b(pytest|npm test|go test)\b` rather than a broad `test` substring.
+
 These hooks are opt-in. Add them to your local `hooks/hooks.json` if you want automated evaluation prompts.
 
 ## Manual Usage (Recommended)
diff --git a/skills/agent-self-evaluation/scripts/evaluate.py b/skills/agent-self-evaluation/scripts/evaluate.py
index 0446106b..566242a1 100755
--- a/skills/agent-self-evaluation/scripts/evaluate.py
+++ b/skills/agent-self-evaluation/scripts/evaluate.py
@@ -83,7 +83,7 @@ def check_accuracy(text: str) -> AxisScore:
     return result
 
 
-def check_completeness(text: str, task: Optional[str] = None) -> AxisScore:
+def check_completeness(text: str) -> AxisScore:
     """Check for requirement coverage, edge cases, error handling."""
     evidence = []
     score = 5
@@ -125,13 +125,36 @@ def check_completeness(text: str, task: Optional[str] = None) -> AxisScore:
     return result
 
 
+def _check_jargon(text: str) -> tuple[int, list[str]]:
+    """Return clarity deductions for unexplained domain jargon."""
+    jargon = [
+        (r"\b(idempotent|race condition|deadlock|thundering herd)\b", "concurrency"),
+        (r"\b(exponential backoff|circuit breaker|bulkhead)\b", "resilience"),
+        (r"\b(ACID|CAP|eventual consistency|linearizability)\b", "database theory"),
+    ]
+    explanation_pattern = r"(?i)({domain}|means|refers to|i\.e\.|in other words)"
+    for pattern, domain in jargon:
+        has_term = re.search(pattern, text, re.IGNORECASE)
+        explains_term = re.search(explanation_pattern.format(domain=domain), text)
+        if has_term and not explains_term:
+            return 1, [f"- Domain term used without explanation ({domain})"]
+    return 0, []
+
+
+def _check_summary(text: str) -> tuple[int, list[str]]:
+    """Return clarity deduction when long output lacks an early summary."""
+    summary_terms = ["summary", "tldr", "overview", "in short"]
+    has_early_summary = any(term in text[:100].lower() for term in summary_terms)
+    if not has_early_summary and count_words(text) > 300:
+        return 1, ["- No summary/TLDR in first 100 words (text is 300+ words)"]
+    return 0, []
+
+
 def check_clarity(text: str) -> AxisScore:
     """Check for structure, readability, jargon handling."""
     evidence = []
-    score = 5
     deductions = 0
 
-    # Positive signals
     if re.search(r"^#{1,3}\s+", text, re.MULTILINE):
         evidence.append("+ Uses headings for structure")
     if re.search(r"```", text):
@@ -139,33 +162,16 @@ def check_clarity(text: str) -> AxisScore:
     if re.search(r"^\s*[-*]\s+", text, re.MULTILINE):
         evidence.append("+ Uses bullet points")
 
-    # Negative signals
-    # Wall of text: long paragraph without breaks
-    paragraphs = [p for p in text.split("\n\n") if p.strip()]
-    for p in paragraphs:
-        if count_words(p) > 200:
+    for paragraph in [p for p in text.split("\n\n") if p.strip()]:
+        if count_words(paragraph) > 200:
             deductions += 1
             evidence.append("- Wall-of-text paragraph (>200 words without break)")
             break
 
-    # Jargon without definition
-    jargon = [
-        (r"\b(idempotent|race condition|deadlock|thundering herd)\b", "concurrency"),
-        (r"\b(exponential backoff|circuit breaker|bulkhead)\b", "resilience"),
-        (r"\b(ACID|CAP|eventual consistency|linearizability)\b", "database theory"),
-    ]
-    for pattern, domain in jargon:
-        if re.search(pattern, text, re.IGNORECASE):
-            if not re.search(rf"(?i)({domain}|means|refers to|i\.e\.|in other words)", text):
-                deductions += 1
-                evidence.append(f"- Domain term used without explanation ({domain})")
-                break
-
-    if not any(t in text[:100].lower() for t in ["summary", "tldr", "overview", "in short"]):
-        # No early summary — penalize only if text is long
-        if count_words(text) > 300:
-            deductions += 1
-            evidence.append("- No summary/TLDR in first 100 words (text is 300+ words)")
+    jargon_deductions, jargon_evidence = _check_jargon(text)
+    summary_deductions, summary_evidence = _check_summary(text)
+    deductions += jargon_deductions + summary_deductions
+    evidence.extend(jargon_evidence + summary_evidence)
 
     if deductions >= 3:
         score = 2
@@ -173,6 +179,8 @@ def check_clarity(text: str) -> AxisScore:
         score = 3
     elif deductions == 1:
         score = 4
+    else:
+        score = 5
 
     if not evidence:
         evidence.append("+ Well-structured with no clarity issues detected")
@@ -227,7 +235,7 @@ def check_actionability(text: str) -> AxisScore:
     return result
 
 
-def check_concision(text: str, task: Optional[str] = None) -> AxisScore:
+def check_conciseness(text: str, task: Optional[str] = None) -> AxisScore:
     """Check for redundancy, filler, information density."""
     evidence = []
     score = 5
@@ -278,10 +286,10 @@ def evaluate(task: Optional[str], output: str) -> list[AxisScore]:
     """Run all 5 axis checks and return scored results."""
     return [
         check_accuracy(output),
-        check_completeness(output, task),
+        check_completeness(output),
         check_clarity(output),
         check_actionability(output),
-        check_concision(output, task),
+        check_conciseness(output, task),
     ]
 
 
@@ -292,13 +300,13 @@ def format_report(scores: list[AxisScore]) -> str:
     lines.append("=" * 60)
     lines.append("AGENT SELF-EVALUATION REPORT")
     lines.append("=" * 60)
+    lines.append(f"Summary: Overall score {avg:.1f}/5 across 5 quality axes.")
     lines.append("")
 
     for s in scores:
         bar = "█" * s.score + "░" * (5 - s.score)
         lines.append(f"  {s.name:<15} {bar} {s.score}/5")
-        for e in s.evidence:
-            lines.append(f"    {e}")
+        lines.extend(f"    {e}" for e in s.evidence)
         if s.improvement:
             lines.append(f"    → {s.improvement}")
         lines.append("")
@@ -316,6 +324,8 @@ def format_report(scores: list[AxisScore]) -> str:
         lines.append("  None")
 
     lines.append("")
+    lines.append("Self-check: Would the user agree with this assessment? [Yes/No + brief justification]")
+    lines.append("")
 
     # Top improvements (axes scoring < 4, ranked by impact)
     improvements = [(s, s.improvement) for s in scores if s.improvement and s.score < 4]
@@ -344,6 +354,31 @@ def format_report(scores: list[AxisScore]) -> str:
     return "\n".join(lines)
 
 
+def _read_file_or_text(path: Optional[str], required: bool = False) -> Optional[str]:
+    """Read a file path or return inline text when allowed."""
+    if path is None:
+        return None
+    try:
+        with open(path) as f:
+            return f.read()
+    except FileNotFoundError:
+        if required:
+            print(f"Error: output file '{path}' not found", file=sys.stderr)
+            sys.exit(1)
+        return path
+
+
+def _read_input(args: argparse.Namespace) -> tuple[Optional[str], str]:
+    """Read task and output for interactive, file, or pipe mode."""
+    if args.interactive:
+        task = input("Task description: ").strip()
+        print("Paste agent output (Ctrl+D to finish):")
+        return task, sys.stdin.read()
+    if args.output:
+        return _read_file_or_text(args.task), _read_file_or_text(args.output, required=True) or ""
+    return _read_file_or_text(args.task), sys.stdin.read()
+
+
 def main():
     parser = argparse.ArgumentParser(
         description="Evaluate agent output against the 5-axis rubric"
@@ -353,38 +388,7 @@ def main():
     parser.add_argument("--interactive", action="store_true", help="Prompt for task and read output from stdin")
     args = parser.parse_args()
 
-    task = None
-    output = None
-
-    if args.interactive:
-        task = input("Task description: ").strip()
-        print("Paste agent output (Ctrl+D to finish):")
-        output = sys.stdin.read()
-    elif args.task and args.output:
-        # Read task
-        try:
-            with open(args.task) as f:
-                task = f.read()
-        except FileNotFoundError:
-            task = args.task  # Treat as inline text
-
-        # Read output
-        try:
-            with open(args.output) as f:
-                output = f.read()
-        except FileNotFoundError:
-            print(f"Error: output file '{args.output}' not found", file=sys.stderr)
-            sys.exit(1)
-    else:
-        # Pipe mode: read output from stdin
-        output = sys.stdin.read()
-        if args.task:
-            try:
-                with open(args.task) as f:
-                    task = f.read()
-            except FileNotFoundError:
-                task = args.task
-
+    task, output = _read_input(args)
     if not output:
         print("Error: no output to evaluate", file=sys.stderr)
         sys.exit(1)
diff --git a/skills/agent-self-evaluation/templates/evaluation-report.md b/skills/agent-self-evaluation/templates/evaluation-report.md
index ee0513e2..46737092 100644
--- a/skills/agent-self-evaluation/templates/evaluation-report.md
+++ b/skills/agent-self-evaluation/templates/evaluation-report.md
@@ -6,6 +6,7 @@ Copy this template and fill in after completing a task. The format matches `scri
 ============================================================
 AGENT SELF-EVALUATION REPORT
 ============================================================
+Summary: Overall score X.X/5 across 5 quality axes.
 
   Accuracy         █████ 5/5    or    ███░░ 3/5
     + [Evidence: passing tests, verified claims]
@@ -38,6 +39,8 @@ CRITICAL ISSUES (axes ≤ 2):
   [Axis] Score N/5 — specific fix needed
   (or "None" if no axis ≤ 2)
 
+Self-check: Would the user agree with this assessment? [Yes/No + brief justification]
+
 TOP IMPROVEMENTS:
   1. [Highest impact fix]
   2. [Second highest]

From 7c0a0049a87751911ece37a439c7cc3cbe1777f8 Mon Sep 17 00:00:00 2001
From: Hawthorn <rv.help23@gmail.com>
Date: Wed, 10 Jun 2026 17:59:25 +0530
Subject: [PATCH 05/10] fix: address second-round review comments
MIME-Version: 1.0
Content-Type: text/plain; charset=UTF-8
Content-Transfer-Encoding: 8bit

- Replace httpx.Retry references with correct httpx API usage across all files
  (httpx has no built-in Retry class; use HTTPTransport/Limits instead)
- Fix _check_summary to check first 100 words (not 100 characters)
- Fix template to only show → improvement arrow for non-5 scores
- Clarify hook documentation: hook echoes reminder, does not run evaluator
- Add return type annotation to main()
- Make required parameter keyword-only in _read_file_or_text
---
 agents/agent-evaluator.md                     | 22 ++++++++-----------
 skills/agent-self-evaluation/SKILL.md         |  6 ++---
 .../examples/high-score-example.md            |  4 ++--
 .../examples/low-score-example.md             |  6 ++---
 .../references/evaluation-criteria.md         |  4 ++--
 .../references/hook-integration.md            |  2 +-
 .../agent-self-evaluation/scripts/evaluate.py |  6 ++---
 7 files changed, 23 insertions(+), 27 deletions(-)

diff --git a/agents/agent-evaluator.md b/agents/agent-evaluator.md
index 3169382e..3a22ee93 100644
--- a/agents/agent-evaluator.md
+++ b/agents/agent-evaluator.md
@@ -63,24 +63,20 @@ AGENT SELF-EVALUATION REPORT
 Summary: Overall score X.X/5 across 5 quality axes.
 
   Accuracy         █████ 5/5
-    + [Evidence: passing tests, verified claims]
-    → [Improvement if score < 5]
+    + [Evidence: passing tests, verified claims]  (no → when score = 5)
 
-  Completeness      █████ 5/5
+  Completeness      ████░ 4/5
     + [What's covered]
-    → [Improvement if score < 5]
+    → [Improvement: only shown when score < 5]
 
   Clarity           █████ 5/5
-    + [Structure signals]
-    → [Improvement if score < 5]
+    + [Structure signals]  (no → when score = 5)
 
   Actionability     █████ 5/5
-    + [User can act immediately]
-    → [Improvement if score < 5]
+    + [User can act immediately]  (no → when score = 5)
 
   Conciseness       █████ 5/5
-    + [Information density]
-    → [Improvement if score < 5]
+    + [Information density]  (no → when score = 5)
 
   OVERALL           X.X/5
 
@@ -115,7 +111,7 @@ Summary: Overall score X.X/5 across 5 quality axes.
 
   Accuracy         █████ 5/5
     + Tests passing
-    + grep confirms httpx.Retry used correctly
+    + grep confirms httpx transport configured correctly
     + Import verified
 
   Completeness      ████░ 4/5
@@ -192,13 +188,13 @@ Summary: Overall score X.X/5 across 5 quality axes.
   OVERALL           2.8/5
 
 CRITICAL ISSUES (axes ≤ 2):
-  [Accuracy] Score 2/5 — Wrong library. Use httpx.Retry, not urllib3.Retry.
+  [Accuracy] Score 2/5 — Wrong library. Use httpx, not urllib3.
   [Actionability] Score 2/5 — No deliverable. Create a PR with test file.
 
 Self-check: Would the user agree with this assessment? Yes — the report cites the wrong library, lack of tests, and missing deliverable.
 
 TOP IMPROVEMENTS:
-  1. [Accuracy] Switch to httpx.Retry — grep the codebase first
+  1. [Accuracy] Switch to httpx — grep the codebase first
   2. [Actionability] Create a PR with src/api_client.py + tests
   3. [Completeness] Handle 429, connection errors, and timeout
 
diff --git a/skills/agent-self-evaluation/SKILL.md b/skills/agent-self-evaluation/SKILL.md
index 96edc164..0e1a2fd6 100644
--- a/skills/agent-self-evaluation/SKILL.md
+++ b/skills/agent-self-evaluation/SKILL.md
@@ -114,7 +114,7 @@ Overall: 4.6 — One gap (timeout handling). Fix before merging.
 Task: Add retry logic to HTTP client
 
 Scorecard:
-  Accuracy:    2 — Used urllib3.Retry which doesn't exist in our
+  Accuracy:    2 — Used urllib3 which doesn't match our
                   httpx-based codebase. Wrong library.
   Completeness: 3 — Works for GET. POST/PUT not handled (user
                   said "all HTTP requests").
@@ -125,7 +125,7 @@ Scorecard:
                   3 places instead of one shared RetryConfig object.
 
 Overall: 2.8 — Wrong library used. Needs httpx rewrite.
-  Fix accuracy first (switch to httpx.Retry), then extend to all
+  Fix accuracy first (switch to httpx), then extend to all
   HTTP methods, then consolidate config.
 ```
 
@@ -171,7 +171,7 @@ The evaluation is about the delivered output, not about re-arguing design decisi
 - **Evaluate the output, not the process.** The user cares about what you delivered, not how many iterations you took.
 - **One improvement per weak axis.** Don't list 5 things for one axis — pick the highest-impact gap.
 - **Tie improvements to user impact.** "Missing error handling means the user's API call will crash silently" beats "add error handling."
-- **Be specific about what 'fixed' looks like.** "Re-run with httpx.Retry( total=3, backoff_factor=1.0 )" beats "fix the library issue."
+- **Be specific about what 'fixed' looks like.** "Re-run with httpx transport configured for retries" beats "fix the library issue."
 - **Use tool outputs as evidence.** If tests passed, cite them. If lint is clean, cite it. Don't guess — grep for the proof.
 - **If you can't find any gaps, try harder.** A perfect score across all 5 axes is rare. Ask: "If I were the user, what would annoy me about this output?"
 
diff --git a/skills/agent-self-evaluation/examples/high-score-example.md b/skills/agent-self-evaluation/examples/high-score-example.md
index a2da55b8..46d04587 100644
--- a/skills/agent-self-evaluation/examples/high-score-example.md
+++ b/skills/agent-self-evaluation/examples/high-score-example.md
@@ -6,10 +6,10 @@
 ## Agent Output
 
 ### Summary
-Added `httpx.Retry` transport to the shared `ApiClient` class. All HTTP methods (GET, POST, PUT, DELETE) now retry on 429, 5xx, and connection errors with exponential backoff. 42 tests pass. Single PR ready to merge.
+Added `httpx.HTTPTransport` with retry configuration to the shared `ApiClient` class. All HTTP methods (GET, POST, PUT, DELETE) now retry on 429, 5xx, and connection errors with exponential backoff. 42 tests pass. Single PR ready to merge.
 
 ### Files Changed
-- `src/api_client.py` — Added `RetryTransport` wrapper with `httpx.Retry( total=3, backoff_factor=1.0, status_forcelist=[429, 500, 502, 503, 504] )`
+- `src/api_client.py` — Added `RetryTransport` wrapper with `httpx.HTTPTransport(retries=3)` and exponential backoff configured via `httpx.Limits`
 - `tests/test_api_client.py` — Added 8 tests: happy path, retry on 429, retry on 503, max retries exhausted, timeout, backoff timing assertion, idempotent GET retry, no-retry on 4xx (except 429)
 
 ### Verification
diff --git a/skills/agent-self-evaluation/examples/low-score-example.md b/skills/agent-self-evaluation/examples/low-score-example.md
index 25047e7a..6fff99f6 100644
--- a/skills/agent-self-evaluation/examples/low-score-example.md
+++ b/skills/agent-self-evaluation/examples/low-score-example.md
@@ -7,7 +7,7 @@
 
 Here's a retry implementation. I think this should work for most cases.
 
-I used the urllib3 library since it has built-in retry support. The Retry class handles backoff automatically so we just need to configure it.
+I used the urllib3 library since it has built-in retry support. The Retry class handles backoff automatically.
 
 ```python
 from urllib3.util import Retry
@@ -38,7 +38,7 @@ AGENT SELF-EVALUATION REPORT
     - Untested ("I haven't tested the timeout behavior")
     - Speculation without evidence ("those are probably fine")
     → Wrong library used. Project uses httpx, not urllib3.
-      urllib3.util.Retry is incompatible with httpx transport.
+                      urllib3.util.Retry is incompatible with httpx.
 
   Completeness      ███░░ 3/5
     - Explicit gap acknowledged ("might be edge cases with POST")
@@ -70,7 +70,7 @@ AGENT SELF-EVALUATION REPORT
   OVERALL           2.8/5
 
 TOP IMPROVEMENTS (axes scoring < 4):
-  [Accuracy] Switch to httpx.Retry — grep the codebase to confirm the HTTP
+  [Accuracy] Switch to httpx — grep the codebase to confirm the HTTP
     library before writing code.
   [Actionability] Create a PR with the changed file + test file. Run the
     tests. End with "PR #N ready to merge."
diff --git a/skills/agent-self-evaluation/references/evaluation-criteria.md b/skills/agent-self-evaluation/references/evaluation-criteria.md
index faf83e7d..9a352bf1 100644
--- a/skills/agent-self-evaluation/references/evaluation-criteria.md
+++ b/skills/agent-self-evaluation/references/evaluation-criteria.md
@@ -6,8 +6,8 @@ This reference provides concrete scoring anchors for each axis. Use it when you'
 
 | Score | Anchor | Example |
 |---|---|---|
-| 5 | All facts verified against tool output, docs, or authoritative sources. No errors. | Used `httpx.Retry` — confirmed in httpx docs. All method names verified with grep against codebase. |
-| 4 | One minor inaccuracy that doesn't affect correctness. | Correct library, wrong default value for one parameter (httpx defaults to 1.0s, claimed 0.5s). |
+| 5 | All facts verified against tool output, docs, or authoritative sources. No errors. | Configured retry via httpx transport — confirmed in httpx docs. All method names verified with grep against codebase. |
+| 4 | One minor inaccuracy that doesn't affect correctness. | Correct library, wrong default value for one parameter (claimed 0.5s, docs say 1.0s). |
 | 3 | One significant factual error, or 3+ minor inaccuracies. | Used `urllib3.Retry` in an httpx codebase. Works in this one case but wrong library. |
 | 2 | Multiple significant errors. Output would fail if followed. | Claimed "add this to package.json" but project uses pyproject.toml. Two other config claims also wrong. |
 | 1 | Fundamentally incorrect. Output contradicts itself or known facts. | Code has syntax errors. API endpoint doesn't exist. Claims a function signature that grep disproves. |
diff --git a/skills/agent-self-evaluation/references/hook-integration.md b/skills/agent-self-evaluation/references/hook-integration.md
index 260de2ca..066556f0 100644
--- a/skills/agent-self-evaluation/references/hook-integration.md
+++ b/skills/agent-self-evaluation/references/hook-integration.md
@@ -1,6 +1,6 @@
 # Hook Integration for Session-Stop Self-Evaluation
 
-Add this hook to `hooks/hooks.json` to remind the agent to self-evaluate at the end of every session:
+Add this hook to `hooks/hooks.json` to remind the agent to self-evaluate at the end of every session (the hook echoes a reminder; it does not run the evaluator automatically):
 
 ```json
 {
diff --git a/skills/agent-self-evaluation/scripts/evaluate.py b/skills/agent-self-evaluation/scripts/evaluate.py
index 566242a1..f560dc98 100755
--- a/skills/agent-self-evaluation/scripts/evaluate.py
+++ b/skills/agent-self-evaluation/scripts/evaluate.py
@@ -144,7 +144,7 @@ def _check_jargon(text: str) -> tuple[int, list[str]]:
 def _check_summary(text: str) -> tuple[int, list[str]]:
     """Return clarity deduction when long output lacks an early summary."""
     summary_terms = ["summary", "tldr", "overview", "in short"]
-    has_early_summary = any(term in text[:100].lower() for term in summary_terms)
+    has_early_summary = any(term in ' '.join(text.split()[:100]).lower() for term in summary_terms)
     if not has_early_summary and count_words(text) > 300:
         return 1, ["- No summary/TLDR in first 100 words (text is 300+ words)"]
     return 0, []
@@ -354,7 +354,7 @@ def format_report(scores: list[AxisScore]) -> str:
     return "\n".join(lines)
 
 
-def _read_file_or_text(path: Optional[str], required: bool = False) -> Optional[str]:
+def _read_file_or_text(path: Optional[str], *, required: bool = False) -> Optional[str]:
     """Read a file path or return inline text when allowed."""
     if path is None:
         return None
@@ -379,7 +379,7 @@ def _read_input(args: argparse.Namespace) -> tuple[Optional[str], str]:
     return _read_file_or_text(args.task), sys.stdin.read()
 
 
-def main():
+def main() -> None:
     parser = argparse.ArgumentParser(
         description="Evaluate agent output against the 5-axis rubric"
     )

From 08f66b49095feb034de180576ed9b7aa03ea1537 Mon Sep 17 00:00:00 2001
From: Hawthorn <rv.help23@gmail.com>
Date: Wed, 10 Jun 2026 18:18:58 +0530
Subject: [PATCH 06/10] fix(agents): add Bash tool guardrails to
 agent-evaluator

List allowed read-only commands (grep, cat, ls, find, head, tail, wc, stat,
git log/diff/show) and explicitly forbid destructive commands (rm, mv, chmod,
git push, git commit, sudo, pip/npm install, curl|wget piping to sh). Any
write/delete/remote-push requires explicit user confirmation.
---
 agents/agent-evaluator.md | 4 ++++
 1 file changed, 4 insertions(+)

diff --git a/agents/agent-evaluator.md b/agents/agent-evaluator.md
index 3a22ee93..b827bf44 100644
--- a/agents/agent-evaluator.md
+++ b/agents/agent-evaluator.md
@@ -20,6 +20,10 @@ You are a quality evaluator for AI agent output. Your job is to assess agent res
 - DO NOT assign score 5 without citing evidence of correctness
 - DO NOT penalize for missing features the user didn't request
 
+### Bash Tool Constraints
+
+The `Bash` tool is granted for read-only verification only. Allowed: `grep`, `cat`, `ls`, `find`, `head`, `tail`, `wc`, `stat`, `git log`, `git diff`, `git show`. Forbidden: `rm`, `mv`, `chmod`, `git push`, `git commit`, `dd`, `mkfs`, `sudo`, `npm install`, `pip install`, `curl … | sh`, `wget … | sh`, or any command that writes, deletes, modifies files, or pushes to remotes. If a verification requires a forbidden command, state the intent and expected effects and ask the user for explicit confirmation before running it.
+
 ## Workflow
 
 ### Step 1: Understand the Task

From f65ab491be3b748502566e179c63235bffe878da Mon Sep 17 00:00:00 2001
From: Hawthorn <rv.help23@gmail.com>
Date: Wed, 10 Jun 2026 18:21:12 +0530
Subject: [PATCH 07/10] fix(docs): clarify Stop event matcher is optional, not
 disallowed
MIME-Version: 1.0
Content-Type: text/plain; charset=UTF-8
Content-Transfer-Encoding: 8bit

Validator (scripts/ci/validate-hooks.js line 182-184) only errors when
matcher is missing for non-EVENTS_WITHOUT_MATCHER events. For Stop (in
EVENTS_WITHOUT_MATCHER), matcher is optional — presence is allowed and
validated for type correctness, absence is also accepted.
---
 skills/agent-self-evaluation/references/hook-integration.md | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/skills/agent-self-evaluation/references/hook-integration.md b/skills/agent-self-evaluation/references/hook-integration.md
index 066556f0..e56455f3 100644
--- a/skills/agent-self-evaluation/references/hook-integration.md
+++ b/skills/agent-self-evaluation/references/hook-integration.md
@@ -20,7 +20,7 @@ Add this hook to `hooks/hooks.json` to remind the agent to self-evaluate at the
 }
 ```
 
-`Stop` events do not use a `matcher` field. Keep the hook object limited to `hooks` and metadata such as `description`.
+`Stop` events do not require a `matcher` field (it is optional for `Stop`, `Notification`, `UserPromptSubmit`, and `SubagentStop` per `scripts/ci/validate-hooks.js`). If omitted, the hook object only needs `hooks` and metadata such as `description`.
 
 ## Integration with the Python Evaluator
 

From 8d360fb46642f406fe95f67a2b3697fcf0238bed Mon Sep 17 00:00:00 2001
From: Hawthorn <rv.help23@gmail.com>
Date: Wed, 10 Jun 2026 18:27:27 +0530
Subject: [PATCH 08/10] fix: address remaining review nits
MIME-Version: 1.0
Content-Type: text/plain; charset=UTF-8
Content-Transfer-Encoding: 8bit

- Add top-level hooks wrapper to second JSON example (consistent with hooks.json format)
- Extract hardcoded thresholds as module-level constants (WALL_OF_TEXT_WORDS,
  SUMMARY_CHECK_WORDS, SUMMARY_CHECK_FIRST_N, TASK_OUTPUT_RATIO_HIGH/MEDIUM)

Skipped (not applicable):
- 'Scoring defaults to 5/5' — by design for heuristic fallback; SKILL.md already
  documents pairing with LLM judge for production use
- '--output silently ignored' — already fixed by _read_input refactor (checks
  args.output directly, not elif args.task and args.output)
---
 .../references/hook-integration.md            | 26 ++++++++++---------
 .../agent-self-evaluation/scripts/evaluate.py | 17 ++++++++----
 2 files changed, 26 insertions(+), 17 deletions(-)

diff --git a/skills/agent-self-evaluation/references/hook-integration.md b/skills/agent-self-evaluation/references/hook-integration.md
index e56455f3..2bb3c3ed 100644
--- a/skills/agent-self-evaluation/references/hook-integration.md
+++ b/skills/agent-self-evaluation/references/hook-integration.md
@@ -38,18 +38,20 @@ To integrate it into hooks, capture the last agent output to a file first, then
 
 ```json
 {
-  "PostToolUse": [
-    {
-      "matcher": "Bash",
-      "hooks": [
-        {
-          "type": "command",
-          "command": "echo '[Self-Eval] If this command completed verification for a non-trivial task, consider running agent-self-evaluation.'"
-        }
-      ],
-      "description": "Remind agent to self-evaluate after shell verification"
-    }
-  ]
+  "hooks": {
+    "PostToolUse": [
+      {
+        "matcher": "Bash",
+        "hooks": [
+          {
+            "type": "command",
+            "command": "echo '[Self-Eval] If this command completed verification for a non-trivial task, consider running agent-self-evaluation.'"
+          }
+        ],
+        "description": "Remind agent to self-evaluate after shell verification"
+      }
+    ]
+  }
 }
 ```
 
diff --git a/skills/agent-self-evaluation/scripts/evaluate.py b/skills/agent-self-evaluation/scripts/evaluate.py
index f560dc98..2d129c40 100755
--- a/skills/agent-self-evaluation/scripts/evaluate.py
+++ b/skills/agent-self-evaluation/scripts/evaluate.py
@@ -24,6 +24,13 @@ import sys
 from dataclasses import dataclass, field
 from typing import Optional
 
+# Tunable thresholds for evaluation heuristics
+WALL_OF_TEXT_WORDS = 200
+SUMMARY_CHECK_WORDS = 300
+SUMMARY_CHECK_FIRST_N = 100
+TASK_OUTPUT_RATIO_HIGH = 15
+TASK_OUTPUT_RATIO_MEDIUM = 8
+
 
 @dataclass
 class AxisScore:
@@ -144,8 +151,8 @@ def _check_jargon(text: str) -> tuple[int, list[str]]:
 def _check_summary(text: str) -> tuple[int, list[str]]:
     """Return clarity deduction when long output lacks an early summary."""
     summary_terms = ["summary", "tldr", "overview", "in short"]
-    has_early_summary = any(term in ' '.join(text.split()[:100]).lower() for term in summary_terms)
-    if not has_early_summary and count_words(text) > 300:
+    has_early_summary = any(term in ' '.join(text.split()[:SUMMARY_CHECK_FIRST_N]).lower() for term in summary_terms)
+    if not has_early_summary and count_words(text) > SUMMARY_CHECK_WORDS:
         return 1, ["- No summary/TLDR in first 100 words (text is 300+ words)"]
     return 0, []
 
@@ -163,7 +170,7 @@ def check_clarity(text: str) -> AxisScore:
         evidence.append("+ Uses bullet points")
 
     for paragraph in [p for p in text.split("\n\n") if p.strip()]:
-        if count_words(paragraph) > 200:
+        if count_words(paragraph) > WALL_OF_TEXT_WORDS:
             deductions += 1
             evidence.append("- Wall-of-text paragraph (>200 words without break)")
             break
@@ -245,10 +252,10 @@ def check_conciseness(text: str, task: Optional[str] = None) -> AxisScore:
     if task:
         task_wc = count_words(task)
         ratio = wc / max(task_wc, 1)
-        if ratio > 15:
+        if ratio > TASK_OUTPUT_RATIO_HIGH:
             evidence.append(f"- Output is {ratio:.0f}x longer than task description (high ratio)")
             score = min(score, 3)
-        elif ratio > 8:
+        elif ratio > TASK_OUTPUT_RATIO_MEDIUM:
             evidence.append(f"- Output is {ratio:.0f}x longer than task description")
             score = min(score, 4)
 

From 1e679bcb4775c42d6f59e3727176180d04040484 Mon Sep 17 00:00:00 2001
From: Hawthorn <rv.help23@gmail.com>
Date: Wed, 10 Jun 2026 18:30:22 +0530
Subject: [PATCH 09/10] fix(agents): harden git commands against pager-based
 code execution

Git commands (log, diff, show) can execute arbitrary code via:
- core.pager set in repo-local .git/config
- diff.external pointing to an attacker-controlled binary
- filter drivers in .gitattributes

Mitigation: require --no-pager flag, recommend -c core.pager=cat
to disable pager-driven execution. Moved git commands from the
unqualified allowlist to a hardened allowlist with explicit flags.
---
 agents/agent-evaluator.md | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/agents/agent-evaluator.md b/agents/agent-evaluator.md
index b827bf44..04317118 100644
--- a/agents/agent-evaluator.md
+++ b/agents/agent-evaluator.md
@@ -22,7 +22,7 @@ You are a quality evaluator for AI agent output. Your job is to assess agent res
 
 ### Bash Tool Constraints
 
-The `Bash` tool is granted for read-only verification only. Allowed: `grep`, `cat`, `ls`, `find`, `head`, `tail`, `wc`, `stat`, `git log`, `git diff`, `git show`. Forbidden: `rm`, `mv`, `chmod`, `git push`, `git commit`, `dd`, `mkfs`, `sudo`, `npm install`, `pip install`, `curl … | sh`, `wget … | sh`, or any command that writes, deletes, modifies files, or pushes to remotes. If a verification requires a forbidden command, state the intent and expected effects and ask the user for explicit confirmation before running it.
+The `Bash` tool is granted for read-only verification only. Allowed: `grep`, `cat`, `ls`, `find`, `head`, `tail`, `wc`, `stat`. Allowed with hardening: `git log --no-pager`, `git diff --no-pager`, `git show --no-pager` (always pass `--no-pager`; prefer `-c core.pager=cat` to disable pager-driven code execution via repo-local `.git/config`). Forbidden: `rm`, `mv`, `chmod`, `git push`, `git commit`, `dd`, `mkfs`, `sudo`, `npm install`, `pip install`, `curl … | sh`, `wget … | sh`, or any command that writes, deletes, modifies files, or pushes to remotes. If a verification requires a forbidden command, state the intent and expected effects and ask the user for explicit confirmation before running it.
 
 ## Workflow
 

From 149be89d397ef448acd243f000e3c004eb67c48f Mon Sep 17 00:00:00 2001
From: Hawthorn <rv.help23@gmail.com>
Date: Thu, 11 Jun 2026 17:58:57 +0530
Subject: [PATCH 10/10] fix: address final lint blockers for agent
 self-evaluation

- Replace U+274C cross-mark examples with ASCII FAIL: prefixes
- Ensure agent-evaluator markdown ends with trailing newline
- Replace markdown placeholder underscores with bracketed placeholders to satisfy markdownlint MD037
---
 agents/agent-evaluator.md                              |  2 +-
 skills/agent-self-evaluation/SKILL.md                  | 10 +++++-----
 .../references/evaluation-criteria.md                  |  2 +-
 .../templates/evaluation-report.md                     |  2 +-
 4 files changed, 8 insertions(+), 8 deletions(-)

diff --git a/agents/agent-evaluator.md b/agents/agent-evaluator.md
index 04317118..c44242ba 100644
--- a/agents/agent-evaluator.md
+++ b/agents/agent-evaluator.md
@@ -203,4 +203,4 @@ TOP IMPROVEMENTS:
   3. [Completeness] Handle 429, connection errors, and timeout
 
 VERDICT: Redo with specific fixes. Weakest axis: Accuracy (2/5).
-```
\ No newline at end of file
+```
diff --git a/skills/agent-self-evaluation/SKILL.md b/skills/agent-self-evaluation/SKILL.md
index 0e1a2fd6..4e241380 100644
--- a/skills/agent-self-evaluation/SKILL.md
+++ b/skills/agent-self-evaluation/SKILL.md
@@ -85,7 +85,7 @@ If any axis scored 3 or below:
 
 1. State what you would do differently
 2. If the gap is fixable in < 30 seconds (missing link, unclear phrasing), fix it now
-3. If the gap requires rework, flag it explicitly: "This axis scored __ because __. Re-running with __ would likely raise it to __."
+3. If the gap requires rework, flag it explicitly: "This axis scored [reason] because [evidence]. Re-running with [specific fix] would likely raise it to [score]."
 
 ## Code Examples
 
@@ -134,7 +134,7 @@ Overall: 2.8 — Wrong library used. Needs httpx rewrite.
 ### "Everything is a 5"
 
 ```
-❌ Accuracy:    5 — All good.
+FAIL: Accuracy:    5 — All good.
    Completeness: 5 — Everything covered.
    Clarity:      5 — Clear.
 ```
@@ -144,7 +144,7 @@ No evidence cited. This is self-congratulation, not evaluation. A real 5 require
 ### Over-penalizing for scope creep
 
 ```
-❌ Completeness: 2 — Didn't handle WebSocket connections or
+FAIL: Completeness: 2 — Didn't handle WebSocket connections or
    gRPC streaming (user didn't ask for these)
 ```
 
@@ -153,7 +153,7 @@ Only evaluate against what the user actually requested, not what you could have
 ### Using the evaluation to re-litigate
 
 ```
-❌ "As I said earlier, this approach is wrong. Score: 1"
+FAIL: "As I said earlier, this approach is wrong. Score: 1"
 ```
 
 The evaluation is about the delivered output, not about re-arguing design decisions that were already made. If the approach was wrong, that should have been caught before delivery.
@@ -161,7 +161,7 @@ The evaluation is about the delivered output, not about re-arguing design decisi
 ### Mixing personal preference with objective gaps
 
 ```
-❌ "Score: 3. I don't like Python decorators."
+FAIL: "Score: 3. I don't like Python decorators."
 ```
 
 "Don't like" is not evidence. Cite a concrete readability, testability, or correctness concern, or leave the score at 4+.
diff --git a/skills/agent-self-evaluation/references/evaluation-criteria.md b/skills/agent-self-evaluation/references/evaluation-criteria.md
index 9a352bf1..fbb3cf90 100644
--- a/skills/agent-self-evaluation/references/evaluation-criteria.md
+++ b/skills/agent-self-evaluation/references/evaluation-criteria.md
@@ -56,7 +56,7 @@ This reference provides concrete scoring anchors for each axis. Use it when you'
 
 ### When the user gave unclear instructions
 
-If the user's request was ambiguous, do NOT penalize completeness for not reading minds. Instead, note in the evaluation: "User's request was ambiguous about __. I chose interpretation __. If they meant __, this score would drop to __."
+If the user's request was ambiguous, do NOT penalize completeness for not reading minds. Instead, note in the evaluation: "User's request was ambiguous about [scope]. I chose interpretation [chosen interpretation]. If they meant [alternative interpretation], this score would drop to [score]."
 
 ### When the task is inherently simple
 
diff --git a/skills/agent-self-evaluation/templates/evaluation-report.md b/skills/agent-self-evaluation/templates/evaluation-report.md
index 46737092..bbc06d4b 100644
--- a/skills/agent-self-evaluation/templates/evaluation-report.md
+++ b/skills/agent-self-evaluation/templates/evaluation-report.md
@@ -83,4 +83,4 @@ Skip the evaluation if:
 | ≥4.5 | Deliver as-is. No changes needed. |
 | 3.5–4.4 | Flag top improvement but deliver. Fix if <30 seconds. |
 | 2.5–3.4 | State what you'd change. Ask user: "Should I redo [axis] or deliver as-is?" |
-| <2.5 | Don't deliver. Say: "This scored __ because __. Let me redo this with [specific fix]." Then redo. |
+| <2.5 | Don't deliver. Say: "This scored [score] because [evidence]. Let me redo this with [specific fix]." Then redo. |