From bd4594794158d7cad3619fd5a3e15fd3ce5502cc Mon Sep 17 00:00:00 2001 From: Hawthorn Date: Wed, 10 Jun 2026 16:56:18 +0530 Subject: [PATCH 01/10] feat(skills,agents): add agent-self-evaluation skill and agent-evaluator persona Add structured 5-axis self-evaluation framework for agent output quality: - Accuracy, Completeness, Clarity, Actionability, Conciseness - Evidence-based scoring with concrete improvement suggestions - Standalone Python evaluator script with keyword heuristics - Detailed scoring anchors reference guide - High-score and low-score annotated examples - Reusable evaluation report template - Optional hook integration for session-stop evaluation Agent persona (agent-evaluator) provides a dedicated subagent for applying the rubric to agent output with tool-backed verification. All files tested: Python script runs, examples score correctly (high 4.2, low 3.4), frontmatter parses clean, 183 lines (under 500). --- agents/agent-evaluator.md | 152 +++++++ skills/agent-self-evaluation/SKILL.md | 182 +++++++++ .../examples/high-score-example.md | 87 ++++ .../examples/low-score-example.md | 86 ++++ .../references/evaluation-criteria.md | 71 ++++ .../references/hook-integration.md | 59 +++ .../agent-self-evaluation/scripts/evaluate.py | 371 ++++++++++++++++++ .../templates/evaluation-report.md | 70 ++++ 8 files changed, 1078 insertions(+) create mode 100644 agents/agent-evaluator.md create mode 100644 skills/agent-self-evaluation/SKILL.md create mode 100644 skills/agent-self-evaluation/examples/high-score-example.md create mode 100644 skills/agent-self-evaluation/examples/low-score-example.md create mode 100644 skills/agent-self-evaluation/references/evaluation-criteria.md create mode 100644 skills/agent-self-evaluation/references/hook-integration.md create mode 100755 skills/agent-self-evaluation/scripts/evaluate.py create mode 100644 skills/agent-self-evaluation/templates/evaluation-report.md diff --git a/agents/agent-evaluator.md b/agents/agent-evaluator.md new file mode 100644 index 00000000..dbb7a904 --- /dev/null +++ b/agents/agent-evaluator.md @@ -0,0 +1,152 @@ +--- +name: agent-evaluator +description: Evaluates agent output against 5-axis quality rubric (accuracy, completeness, clarity, actionability, concision). Use after any non-trivial task when the user wants a quality assessment, or when the agent-self-evaluation skill is active. Produces structured scorecard with evidence and improvement suggestions. +tools: ["Read", "Grep", "Glob", "Bash"] +model: sonnet +--- + +You are a quality evaluator for AI agent output. Your job is to assess agent responses against structured criteria, not to perform the original task. + +## Your Role + +- Score agent output on 5 axes: Accuracy, Completeness, Clarity, Actionability, Conciseness +- Every score below 5 MUST cite specific evidence from the output +- Provide concrete, actionable improvement suggestions +- Maintain objectivity — evaluate the output, not the agent's effort or intent +- Load the `agent-self-evaluation` skill for the detailed scoring rubric + +- DO NOT re-perform the original task +- DO NOT suggest alternative approaches unless the current approach is factually wrong +- DO NOT assign score 5 without citing evidence of correctness +- DO NOT penalize for missing features the user didn't request + +## Workflow + +### Step 1: Understand the Task + +Read the user's original request and the agent's final output. Identify: +- What was explicitly asked for +- What was implicitly expected (standard practices, edge cases) +- What the agent claimed to deliver + +### Step 2: Gather Evidence + +Use tools to verify claims: +- Run `grep` to confirm API names, function signatures, file paths +- Check test output for pass/fail status +- Verify that files the agent claims to have created actually exist +- Cross-reference claims against project conventions (check existing files for patterns) + +### Step 3: Score Each Axis + +Work through the 5 axes from the `agent-self-evaluation` skill: + +1. **Accuracy** — Are claims correct? Grep the codebase to verify. +2. **Completeness** — All requirements covered? List what's there and what's missing. +3. **Clarity** — Well-structured? Check for headings, code blocks, summaries. +4. **Actionability** — Can the user act immediately? Is there a PR, a command, a file? +5. **Conciseness** — No fluff? Check for redundancy, filler, meta-commentary. + +For each axis: +- Assign score 1-5 +- If score < 5, cite the specific gap with evidence (line numbers, grep output, file existence) +- Write a one-sentence improvement + +### Step 4: Produce Report + +Use this format: + +``` +============================================================ +AGENT EVALUATION REPORT +============================================================ + + Axis Score Evidence + + Accuracy X/5 [What was verified, what was wrong] + Completeness X/5 [What's covered, what's missing] + Clarity X/5 [Structure quality, readability] + Actionability X/5 [Can user act now? What's the next step?] + Conciseness X/5 [Information density, redundancy] + + OVERALL X.X/5 + +CRITICAL ISSUES (axes ≤ 2): + [If any axis scored 2 or below, list it here with the specific fix needed] + +TOP IMPROVEMENTS: + 1. [Highest impact fix first] + 2. [Second highest] + 3. [Third highest] + +VERDICT: [Deliver as-is / Fix N issues then deliver / Redo from scratch] +``` + +## Output Format + +Always include the structured report above. After the report, add a one-line verdict: "Deliver as-is", "Fix [specific issue] then deliver", or "Redo with [specific approach]". + +## Examples + +### Example: Strong Output + +Task: Add retry logic to HTTP client. 3 retries, exponential backoff. + +``` +AGENT EVALUATION REPORT + + Accuracy 5/5 grep confirms httpx.Retry used correctly. + Tests pass (42/42). Import verified. + Completeness 4/5 All HTTP methods covered. Missing: connection + pool exhaustion handling (minor edge case). + Clarity 5/5 Well-structured. Summary, code blocks, bullet + points. 10-second scan tells the full story. + Actionability 5/5 Single PR (#423). `pytest -v` cited. Merge is + the only action needed. + Conciseness 4/5 250 words. Verification section slightly + verbose — 3 commands could be 1 script. + + OVERALL 4.6/5 + +TOP IMPROVEMENTS: + 1. Add connection pool exhaustion to edge cases doc + 2. Consolidate verification commands into a single script + +VERDICT: Deliver as-is. The one gap (pool exhaustion) is a P2 edge case. +``` + +### Example: Weak Output + +Task: Same as above. + +``` +AGENT EVALUATION REPORT + + Accuracy 2/5 CRITICAL: Agent used urllib3.Retry but project + uses httpx. grep proves no urllib3 import exists. + Hedging language: "I think", "probably fine". + Completeness 3/5 Only handles 5xx. Missing: 429 rate limiting, + connection timeouts. Agent acknowledges gaps + ("might be edge cases") but doesn't fix them. + Clarity 3/5 Code is readable but no explanation of where + to integrate. "Add this somewhere" is vague. + Actionability 2/5 No PR, no file created, no test written. + User has to: figure out placement, fix library, + write tests, handle idempotency. + Conciseness 3/5 120 words but ~50% is hedging/disclaimers. + Low information density. + + OVERALL 2.6/5 + +CRITICAL ISSUES: + Accuracy: Wrong library. Use httpx.Retry, not urllib3.Retry. + Actionability: No deliverable. Create a PR with the changed file + tests. + +TOP IMPROVEMENTS: + 1. Switch to httpx.Retry — grep the codebase first to confirm the HTTP library + 2. Create a PR with src/api_client.py + tests/test_api_client.py + 3. Handle 429, connection errors, and timeout — not just 5xx + +VERDICT: Redo with httpx.Retry, full HTTP method coverage, and a test file. + Do not deliver until accuracy ≥ 4. +``` diff --git a/skills/agent-self-evaluation/SKILL.md b/skills/agent-self-evaluation/SKILL.md new file mode 100644 index 00000000..0aa3c986 --- /dev/null +++ b/skills/agent-self-evaluation/SKILL.md @@ -0,0 +1,182 @@ +--- +name: agent-self-evaluation +description: Use after completing any non-trivial task. The agent self-rates its output on 5 axes — accuracy, completeness, clarity, actionability, conciseness — with concrete evidence per criterion. Produces a structured 1-5 scorecard with specific improvement suggestions. +origin: ECC +--- + +# Agent Self-Evaluation + +After completing a complex task, the agent pauses to rate its own output against a structured 5-axis rubric. This is NOT a pass/fail gate — it's a deliberate reflection step that catches omissions, flags overconfidence, and surface areas for improvement before the user has to. + +## When to Activate + +- After writing code that spans 3+ files or 50+ lines +- After completing a multi-step workflow (implement → test → review) +- After a debugging session that involved 3+ attempts +- After producing a design document, architecture decision, or written analysis +- When the user asks "how good was that?" or "rate yourself" +- At the end of any session Stop hook (if configured — see References) + +## Core Concepts + +### The 5 Evaluation Axes + +| Axis | Question | What it catches | +|---|---|---| +| **Accuracy** | Are the facts, claims, and outputs correct? | Hallucinations, wrong API names, incorrect syntax, false statements | +| **Completeness** | Did it cover everything the user asked for? | Missed edge cases, unhandled error paths, forgotten requirements, skipped subtasks | +| **Clarity** | Is the explanation understandable and well-structured? | Confusing explanations, jargon without definition, missing context, rambling | +| **Actionability** | Can the user act on the output immediately? | Vague suggestions, missing steps, "you should X" without showing how, no verification path | +| **Conciseness** | Did it use the minimum words/tokens needed? | Redundancy, over-explanation, repeating the user's question verbatim, filler content | + +### Scoring Scale + +``` +5 — Exceptional: no reasonable improvement possible +4 — Good: minor nits only, no substantive gaps +3 — Adequate: meets the request but has a notable weakness on at least one axis +2 — Weak: has a clear gap that affects usability or correctness +1 — Poor: fundamentally misses the request or contains significant errors +``` + +### The Evidence Rule + +Every score below 5 MUST cite specific evidence. A score of 3 cannot just say "could be better" — it must say exactly what is missing or wrong. The mantra: **"Show the gap, don't just name it."** + +## Workflow + +### Step 1: Collect the Raw Material + +Gather what you'll evaluate: + +``` +- The original user request (read back from conversation) +- Your final response/output (the deliverable) +- Any tool outputs that verify correctness (test results, exit codes, lint output) +- Any user feedback received during the task (corrections, "try again", "that's not right") +``` + +### Step 2: Score Each Axis Independently + +Work through the 5 axes one at a time. For each: + +1. Read the axis question +2. Find evidence (or lack of evidence) in the output +3. Assign a score 1-5 +4. If score < 5, write a one-sentence improvement note citing the gap + +Do NOT average the scores in your head first and then work backwards. Score each axis fresh. + +### Step 3: Produce the Evaluation Report + +Use the template from `templates/evaluation-report.md`. The report must include: + +``` +- One-line summary +- 5-axis scorecard (score + evidence per axis) +- Overall score (simple average, rounded to 1 decimal) +- 1-3 specific improvements ranked by impact +- Self-check: "Would the user agree with this assessment?" +``` + +### Step 4: Apply the Improvement + +If any axis scored 3 or below: + +1. State what you would do differently +2. If the gap is fixable in < 30 seconds (missing link, unclear phrasing), fix it now +3. If the gap requires rework, flag it explicitly: "This axis scored __ because __. Re-running with __ would likely raise it to __." + +## Code Examples + +### Example: Good Evaluation (Score 4+) + +``` +Task: Add retry logic to HTTP client + +Scorecard: + Accuracy: 5 — All API calls correct. Verified: retries use + exponential backoff. No hallucinated methods. + Completeness: 4 — Covered happy path + 3 error cases. Missing: + timeout handling for hung connections. + Clarity: 5 — Code comments explain backoff formula. + PR description links to incident that motivated this. + Actionability:5 — Single merge. No follow-up tasks. Tests pass. + Conciseness: 4 — 47 lines total. The retry loop could be extracted + into a helper to drop ~8 lines. + +Overall: 4.6 — One gap (timeout handling). Fix before merging. +``` + +### Example: Weak Evaluation (Score 2-3) + +``` +Task: Add retry logic to HTTP client + +Scorecard: + Accuracy: 2 — Used urllib3.Retry which doesn't exist in our + httpx-based codebase. Wrong library. + Completeness: 3 — Works for GET. POST/PUT not handled (user + said "all HTTP requests"). + Clarity: 4 — Code is readable. Good variable names. + Actionability:2 — "Add tests" mentioned but no test file created. + User has to write tests before merging. + Conciseness: 3 — 120 lines. The retry config is duplicated in + 3 places instead of one shared RetryConfig object. + +Overall: 2.8 — Wrong library used. Needs httpx rewrite. + Fix accuracy first (switch to httpx.Retry), then extend to all + HTTP methods, then consolidate config. +``` + +## Anti-Patterns + +### "Everything is a 5" + +``` +❌ Accuracy: 5 — All good. + Completeness: 5 — Everything covered. + Clarity: 5 — Clear. +``` + +No evidence cited. This is self-congratulation, not evaluation. A real 5 requires proving there's nothing to improve. + +### Over-penalizing for scope creep + +``` +❌ Completeness: 2 — Didn't handle WebSocket connections or + gRPC streaming (user didn't ask for these) +``` + +Only evaluate against what the user actually requested, not what you could have additionally built. + +### Using the evaluation to re-litigate + +``` +❌ "As I said earlier, this approach is wrong. Score: 1" +``` + +The evaluation is about the delivered output, not about re-arguing design decisions that were already made. If the approach was wrong, that should have been caught before delivery. + +### Mixing personal preference with objective gaps + +``` +❌ "Score: 3. I don't like Python decorators." +``` + +"Don't like" is not evidence. Cite a concrete readability, testability, or correctness concern, or leave the score at 4+. + +## Best Practices + +- **Evaluate the output, not the process.** The user cares about what you delivered, not how many iterations you took. +- **One improvement per weak axis.** Don't list 5 things for one axis — pick the highest-impact gap. +- **Tie improvements to user impact.** "Missing error handling means the user's API call will crash silently" beats "add error handling." +- **Be specific about what 'fixed' looks like.** "Re-run with httpx.Retry( total=3, backoff_factor=1.0 )" beats "fix the library issue." +- **Use tool outputs as evidence.** If tests passed, cite them. If lint is clean, cite it. Don't guess — grep for the proof. +- **If you can't find any gaps, try harder.** A perfect score across all 5 axes is rare. Ask: "If I were the user, what would annoy me about this output?" + +## Related Skills + +- `agent-eval` — Head-to-head comparison of different coding agents on benchmark tasks +- `verification-loop` — Systematic verification of outputs against expected results +- `security-review` — Security-focused code review checklist diff --git a/skills/agent-self-evaluation/examples/high-score-example.md b/skills/agent-self-evaluation/examples/high-score-example.md new file mode 100644 index 00000000..a2da55b8 --- /dev/null +++ b/skills/agent-self-evaluation/examples/high-score-example.md @@ -0,0 +1,87 @@ +# High-Score Example: Adding Retry Logic + +## Task +> Add retry logic with exponential backoff to all HTTP requests in the API client. Max 3 retries, initial delay 1s, max delay 30s. + +## Agent Output + +### Summary +Added `httpx.Retry` transport to the shared `ApiClient` class. All HTTP methods (GET, POST, PUT, DELETE) now retry on 429, 5xx, and connection errors with exponential backoff. 42 tests pass. Single PR ready to merge. + +### Files Changed +- `src/api_client.py` — Added `RetryTransport` wrapper with `httpx.Retry( total=3, backoff_factor=1.0, status_forcelist=[429, 500, 502, 503, 504] )` +- `tests/test_api_client.py` — Added 8 tests: happy path, retry on 429, retry on 503, max retries exhausted, timeout, backoff timing assertion, idempotent GET retry, no-retry on 4xx (except 429) + +### Verification +``` +$ pytest tests/test_api_client.py -v +42 passed in 2.34s + +$ grep -r "RetryTransport" src/ +src/api_client.py:23: class RetryTransport(httpx.BaseTransport): +src/api_client.py:45: self._transport = RetryTransport(...) + +$ python -c "from src.api_client import ApiClient; c = ApiClient(); print(type(c._transport))" + +``` + +### Edge Cases Handled +- Connection timeouts (retry with backoff) +- Rate limiting (429 → retry after delay) +- Server errors (5xx → retry) +- Client errors except 429 (no retry — 4xx means the request is bad) +- Idempotency: GET, PUT, DELETE retry; POST does not (configurable) + +### Merge: [PR #423](https://github.com/example/pr/423) + +--- + +## Evaluation Report + +``` +============================================================ +AGENT SELF-EVALUATION REPORT +============================================================ + + Accuracy █████ 5/5 + + Tests passing + + Explicit verification (grep confirmed class exists, import test passed) + + Lint clean + → All claims backed by tool output. No hedging. + + Completeness ████░ 4/5 + + Edge cases addressed (5 specific scenarios listed) + + Error handling present across all HTTP methods + → Missing: connection pool exhaustion handling (what happens when all + connections are in retry state?) — minor gap, not blocking. + + Clarity █████ 5/5 + + Uses headings for structure + + Uses code blocks + + Uses bullet points + + Summary in first 3 lines + → Well-organized. Reader can scan in 10 seconds. + + Actionability █████ 5/5 + + PR created and linked + + Specific run command given (pytest) + + Verification steps included + → Single action: merge PR #423. Everything else is done. + + Conciseness ████░ 4/5 + + No redundancy detected + → The verification section could be slightly tighter (3 commands + could be 1 with a verification script). Minor. + + OVERALL 4.6/5 + +TOP IMPROVEMENTS: + No axes below 4. Strong output across all dimensions. +``` + +### Why This Scores Well + +1. **Accuracy pinned to tool output.** Every claim ("tests pass", "class exists", "import works") has a corresponding terminal output line. No "should work" or "probably fine." +2. **Completeness is explicit about what's covered AND what's not.** The edge cases section lists both handled and intentionally-unhandled cases (POST idempotency). +3. **Actionability is single-step.** The user only needs to merge one PR. No follow-up tasks, no "then configure X." +4. **Concision is tight.** The output is ~250 words. The information density is high — every sentence carries weight. diff --git a/skills/agent-self-evaluation/examples/low-score-example.md b/skills/agent-self-evaluation/examples/low-score-example.md new file mode 100644 index 00000000..25047e7a --- /dev/null +++ b/skills/agent-self-evaluation/examples/low-score-example.md @@ -0,0 +1,86 @@ +# Low-Score Example: Adding Retry Logic + +## Task +> Add retry logic with exponential backoff to all HTTP requests in the API client. Max 3 retries, initial delay 1s, max delay 30s. + +## Agent Output + +Here's a retry implementation. I think this should work for most cases. + +I used the urllib3 library since it has built-in retry support. The Retry class handles backoff automatically so we just need to configure it. + +```python +from urllib3.util import Retry + +# Add this somewhere in your HTTP client +retry = Retry( + total=3, + backoff_factor=1.0, + status_forcelist=[500, 502, 503, 504] +) +``` + +You'll want to test this before deploying. There might be edge cases with POST requests but those are probably fine. I haven't tested the timeout behavior but it should work. + +Let me know if you need anything else! + +--- + +## Evaluation Report + +``` +============================================================ +AGENT SELF-EVALUATION REPORT +============================================================ + + Accuracy ██░░░ 2/5 + - Hedged claim without verification ("I think this should work") + - Untested ("I haven't tested the timeout behavior") + - Speculation without evidence ("those are probably fine") + → Wrong library used. Project uses httpx, not urllib3. + urllib3.util.Retry is incompatible with httpx transport. + + Completeness ███░░ 3/5 + - Explicit gap acknowledged ("might be edge cases with POST") + - Limited scope noted (only mentioned 5xx, not 429 or connection errors) + → User asked for "all HTTP requests." Only partial coverage: + missing 429 handling, connection errors, timeout handling. + + Clarity ████░ 4/5 + + Uses code blocks + → Code is readable but no explanation of where to add it + ("somewhere in your HTTP client" is vague). + + Actionability ██░░░ 2/5 + - Defers work to user ("you'll want to test this") + - Vague suggestion without specifics + → No PR, no file created, no test written. User has to: + 1. Figure out where to add the code + 2. Fix the library mismatch (httpx not urllib3) + 3. Write tests + 4. Handle POST idempotency + 5. Test timeout behavior + + Conciseness ███░░ 3/5 + - Meta-commentary adds words without information + ("Let me know if you need anything else!") + → 120 words. Low word count but low information density. + Half the text is hedging and disclaimers, not substance. + + OVERALL 2.8/5 + +TOP IMPROVEMENTS (axes scoring < 4): + [Accuracy] Switch to httpx.Retry — grep the codebase to confirm the HTTP + library before writing code. + [Actionability] Create a PR with the changed file + test file. Run the + tests. End with "PR #N ready to merge." + [Completeness] List what's covered AND what's not. If POST retry is + unsafe, say so explicitly with reasoning. +``` + +### Why This Scores Poorly + +1. **Accuracy fails at the most basic level** — wrong library. One `grep httpx src/` would have caught this. The hedging language ("I think", "probably", "should work") signals the agent knows it's guessing. +2. **Not actionable.** The user received a code snippet and a list of things they need to do. The agent did the easy part (suggesting a library) and deferred the hard parts (testing, integration, edge cases) to the user. +3. **Completeness gaps are acknowledged but not fixed.** "Might be edge cases" is worse than not mentioning them — it shows awareness of the gap and a choice not to address it. +4. **Information density is low.** 120 words, of which ~60 are hedging/disclaimers/politeness. The actual substance (3 lines of code) could have been delivered in 40 words with verification. diff --git a/skills/agent-self-evaluation/references/evaluation-criteria.md b/skills/agent-self-evaluation/references/evaluation-criteria.md new file mode 100644 index 00000000..faf83e7d --- /dev/null +++ b/skills/agent-self-evaluation/references/evaluation-criteria.md @@ -0,0 +1,71 @@ +# Evaluation Criteria — Detailed Scoring Guide + +This reference provides concrete scoring anchors for each axis. Use it when you're unsure whether a gap merits a 4 vs a 3, or a 2 vs a 1. + +## Accuracy + +| Score | Anchor | Example | +|---|---|---| +| 5 | All facts verified against tool output, docs, or authoritative sources. No errors. | Used `httpx.Retry` — confirmed in httpx docs. All method names verified with grep against codebase. | +| 4 | One minor inaccuracy that doesn't affect correctness. | Correct library, wrong default value for one parameter (httpx defaults to 1.0s, claimed 0.5s). | +| 3 | One significant factual error, or 3+ minor inaccuracies. | Used `urllib3.Retry` in an httpx codebase. Works in this one case but wrong library. | +| 2 | Multiple significant errors. Output would fail if followed. | Claimed "add this to package.json" but project uses pyproject.toml. Two other config claims also wrong. | +| 1 | Fundamentally incorrect. Output contradicts itself or known facts. | Code has syntax errors. API endpoint doesn't exist. Claims a function signature that grep disproves. | + +## Completeness + +| Score | Anchor | Example | +|---|---|---| +| 5 | All explicit and implicit requirements covered. Edge cases handled. Error paths addressed. | User said "add retry to all HTTP requests." GET, POST, PUT, DELETE all covered. Timeout, 429, 5xx all handled. | +| 4 | All explicit requirements covered. One implicit requirement missed. | All HTTP methods covered. Forgot to handle connection timeouts (not mentioned but expected). | +| 3 | One explicit requirement missed, or 2+ implicit gaps. | User said "add logging too." Retry logic added but no logging. | +| 2 | Multiple explicit requirements missed. Output is a partial solution. | Asked for retry + circuit breaker. Only retry implemented. | +| 1 | Misses the core request. Delivers something adjacent to what was asked. | Asked for retry logic. Wrote a health check endpoint instead. | + +## Clarity + +| Score | Anchor | Example | +|---|---|---| +| 5 | Perfectly structured. Jargon explained or avoided. Visual hierarchy helps scanning. No ambiguity. | README with clear sections, code blocks, and a 10-second summary at top. | +| 4 | Generally clear. One section could be better organized or one term undefined. | Good structure but `exponential backoff` used without explanation — assumes the reader knows it. | +| 3 | Understandable after re-reading. Multiple organizational issues or undefined terms. | The explanation circles the point before getting to it. Several terms used before defined. | +| 2 | Confusing in places. Reader would need to ask follow-up questions. | Code works but the PR description doesn't explain why retry was needed or what it fixes. | +| 1 | Unintelligible or contradictory. Reader cannot determine what was done or why. | Output is a wall of text with no structure. Conclusions contradict earlier statements. | + +## Actionability + +| Score | Anchor | Example | +|---|---|---| +| 5 | Single action required. Verification path included. No implicit steps. | "Merge this PR. Tests pass: `42 passed`. Deploy with `./deploy.sh`." | +| 4 | Single action required but verification path is implied, not explicit. | "Merge this PR." (Tests exist but weren't cited. User has to check themselves.) | +| 3 | Multiple actions required, or one action with unclear next step. | "Review and merge. Then update the config." (Which config? Where? No link or path.) | +| 2 | User must figure out how to use the output. Missing critical instructions. | Code written but no test file, no run instructions, no PR created. User has to assemble everything. | +| 1 | Output cannot be acted on without significant rework or clarification. | "Here's a design idea." (No code, no file, no PR. User has to start from scratch.) | + +## Conciseness + +| Score | Anchor | Example | +|---|---|---| +| 5 | Every sentence earns its place. No redundancy. Information density is high. | 30 lines that say what 60 lines would. No repeated points. No filler. | +| 4 | Minor redundancy. One paragraph could be tightened. | Good overall but repeats the motivation in both the PR description and code comments. | +| 3 | Noticeable redundancy. 20%+ of content could be removed without loss. | Explains the same concept three times (in summary, body, and conclusion). Verbose examples. | +| 2 | Significantly bloated. 40%+ of content is filler or repetition. | 200 lines for a task that needed 60. Restates the user's question. Includes irrelevant background. | +| 1 | Noise-to-signal ratio is inverted. More filler than substance. | 500-line response to a 2-line question. Most of it is boilerplate, repetition, or irrelevant context. | + +## Edge Cases + +### When the user gave unclear instructions + +If the user's request was ambiguous, do NOT penalize completeness for not reading minds. Instead, note in the evaluation: "User's request was ambiguous about __. I chose interpretation __. If they meant __, this score would drop to __." + +### When the task is inherently simple + +A 3-line bug fix can legitimately score 5/5/5/5/5. The rubric scales with complexity — a simple task done perfectly IS a 5.0. Don't invent gaps to justify lower scores. + +### When you caught your own error mid-task + +If you made an error, caught it, and fixed it before delivering — that's a 5 on Accuracy for the final output. The evaluation is about what the user received, not your internal process. Note the self-correction as evidence of thoroughness, not as a penalty. + +### When the tool output contradicts your claim + +If you claimed "tests pass" but the terminal output shows a failure — that's an automatic Accuracy ≤ 2. Tool output is ground truth. Claims without verification are the most common source of low accuracy scores. diff --git a/skills/agent-self-evaluation/references/hook-integration.md b/skills/agent-self-evaluation/references/hook-integration.md new file mode 100644 index 00000000..78246b37 --- /dev/null +++ b/skills/agent-self-evaluation/references/hook-integration.md @@ -0,0 +1,59 @@ +# Hook Integration for Session-Stop Self-Evaluation + +Add this hook to `hooks/hooks.json` to automatically trigger self-evaluation at the end of every session: + +```json +{ + "hooks": { + "Stop": [ + { + "matcher": "true", + "hooks": [ + { + "type": "command", + "command": "echo '[Self-Eval] Session complete. Consider running agent-self-evaluation to rate your output.'" + } + ], + "description": "Remind agent to self-evaluate at session end" + } + ] + } +} +``` + +## Integration with the Python Evaluator + +The `scripts/evaluate.py` script can be used as a standalone tool: + +```bash +# Pipe agent output directly +echo "Your agent response here" | python3 skills/agent-self-evaluation/scripts/evaluate.py + +# From files +python3 skills/agent-self-evaluation/scripts/evaluate.py --task task.txt --output response.txt +``` + +To integrate it into hooks, capture the last agent output to a file first, then run the evaluator: + +```json +{ + "PostToolUse": [ + { + "matcher": "tool == \"Bash\" && tool_input.command matches \"(test|pytest|npm test|go test)\"", + "hooks": [ + { + "type": "command", + "command": "echo '[Self-Eval] Tests completed. Consider running agent-self-evaluation.'" + } + ], + "description": "Remind agent to self-evaluate after test runs" + } + ] +} +``` + +These hooks are opt-in. Add them to your local `hooks/hooks.json` if you want automated evaluation prompts. + +## Manual Usage (Recommended) + +The most reliable approach is manual invocation — the agent runs self-evaluation as part of its workflow when the `agent-self-evaluation` skill is active, without requiring hook configuration. The skill's "When to Activate" section already covers trigger conditions (multi-file changes, debugging sessions, design documents). diff --git a/skills/agent-self-evaluation/scripts/evaluate.py b/skills/agent-self-evaluation/scripts/evaluate.py new file mode 100755 index 00000000..354f5a7a --- /dev/null +++ b/skills/agent-self-evaluation/scripts/evaluate.py @@ -0,0 +1,371 @@ +#!/usr/bin/env python3 +"""Standalone agent output evaluator using the 5-axis rubric. + +Reads a task description and agent output from stdin or files, +scores each axis, and prints a structured evaluation report. + +Usage: + # Pipe output directly + echo "Task: Add retry logic" | evaluate.py --output response.txt + + # From files + evaluate.py --task task.txt --output response.txt + + # Interactive (reads task from prompt, output from stdin) + evaluate.py --interactive + +The evaluator uses keyword heuristics + structural checks as a first pass. +For production use, pair with an LLM judge for semantic understanding. +""" + +import argparse +import re +import sys +from dataclasses import dataclass, field +from typing import Optional + + +@dataclass +class AxisScore: + name: str + score: int + evidence: list[str] = field(default_factory=list) + improvement: Optional[str] = None + + +def count_words(text: str) -> int: + return len(text.split()) + + +def check_accuracy(text: str) -> AxisScore: + """Check for verifiable claims, tool output references, error signs.""" + evidence = [] + deductions = 0 + score = 5 + + # Positive signals: verified claims + verified_patterns = [ + (r"(?i)(tests?\s+pass|all\s+tests?\s+passing|\d+\s+passed)", "Tests passing"), + (r"(?i)(exit\s+code\s*[:=]?\s*0|exited\s+with\s+0)", "Clean exit code"), + (r"(?i)(lint.*clean|no\s+lint\s+errors|0\s+errors)", "Lint clean"), + (r"(?i)(verified|confirmed|validated)\s+(with|against|using|by)", "Explicit verification"), + (r"(?i)(grep|rg)\s+.*\b(found|matched|returned)", "Grep confirmed"), + ] + for pattern, label in verified_patterns: + if re.search(pattern, text): + evidence.append(f"+ {label}") + + # Negative signals: unverified claims + danger_patterns = [ + (r"(?i)(should\s+work|probably\s+fine|should\s+be\s+ok)", "Hedged claim without verification"), + (r"(?i)(I\s+think|I\s+believe|I\s+assume|might\s+be)", "Speculation without evidence"), + (r"(?i)(untested|not\s+tested|haven'?t\s+tested)", "Explicitly untested"), + (r"(?i)(TODO|FIXME|HACK|WORKAROUND)", "Unresolved TODO/FIXME"), + ] + for pattern, label in danger_patterns: + if re.search(pattern, text): + deductions += 1 + evidence.append(f"- {label}") + + if deductions >= 3: + score = 2 + elif deductions == 2: + score = 3 + elif deductions == 1: + score = 4 + + if not evidence: + evidence.append("No verification signals detected — score assumes correctness") + + result = AxisScore(name="Accuracy", score=score, evidence=evidence) + if score < 5: + result.improvement = "Cite specific tool outputs (test results, exit codes, grep findings) to back claims" + return result + + +def check_completeness(text: str, task: Optional[str] = None) -> AxisScore: + """Check for requirement coverage, edge cases, error handling.""" + evidence = [] + score = 5 + + # Positive signals + completeness_signals = [ + (r"(?i)(edge\s*cases?|corner\s*cases?)", "Edge cases addressed"), + (r"(?i)(error\s*handling|exception\s*handling|try/except|try\s*{)", "Error handling present"), + (r"(?i)(all\s+\w+\s+(methods|endpoints|routes))", "Full coverage claimed"), + (r"(?i)(verification|verified\s+that|confirmed\s+that)", "Verification step present"), + ] + for pattern, label in completeness_signals: + if re.search(pattern, text): + evidence.append(f"+ {label}") + + # Gaps + gap_signals = [ + (r"(?i)(not\s+covered|not\s+handled|out\s+of\s+scope)", "Explicit gap acknowledged"), + (r"(?i)(only\s+(works|handles|supports)\s+\w+)", "Limited scope noted"), + (r"(?i)(assume[sd]?\s+that|assuming\s+the)", "Assumption without verification"), + ] + deductions = 0 + for pattern, label in gap_signals: + if re.search(pattern, text): + deductions += 1 + evidence.append(f"- {label}") + + if deductions >= 2: + score = 3 + elif deductions == 1: + score = 4 + + if not evidence: + evidence.append("No completeness signals — unable to assess coverage") + + result = AxisScore(name="Completeness", score=score, evidence=evidence) + if score < 5: + result.improvement = "List what was covered AND what was intentionally excluded, with reasoning" + return result + + +def check_clarity(text: str) -> AxisScore: + """Check for structure, readability, jargon handling.""" + evidence = [] + score = 5 + deductions = 0 + + # Positive signals + if re.search(r"^#{1,3}\s+", text, re.MULTILINE): + evidence.append("+ Uses headings for structure") + if re.search(r"```", text): + evidence.append("+ Uses code blocks") + if re.search(r"^\s*[-*]\s+", text, re.MULTILINE): + evidence.append("+ Uses bullet points") + + # Negative signals + # Wall of text: long paragraph without breaks + paragraphs = [p for p in text.split("\n\n") if p.strip()] + for p in paragraphs: + if count_words(p) > 200: + deductions += 1 + evidence.append("- Wall-of-text paragraph (>200 words without break)") + break + + # Jargon without definition + jargon = [ + (r"\b(idempotent|race condition|deadlock|thundering herd)\b", "concurrency"), + (r"\b(exponential backoff|circuit breaker|bulkhead)\b", "resilience"), + (r"\b(ACID|CAP|eventual consistency|linearizability)\b", "database theory"), + ] + for pattern, domain in jargon: + if re.search(pattern, text, re.IGNORECASE): + if not re.search(rf"(?i)({domain}|means|refers to|i\.e\.|in other words)", text): + deductions += 1 + evidence.append(f"- Domain term used without explanation ({domain})") + break + + if not any(t in text[:100].lower() for t in ["summary", "tldr", "overview", "in short"]): + # No early summary — penalize only if text is long + if count_words(text) > 300: + deductions += 1 + evidence.append("- No summary/TLDR in first 100 words (text is 300+ words)") + + if deductions >= 3: + score = 2 + elif deductions == 2: + score = 3 + elif deductions == 1: + score = 4 + + if not evidence: + evidence.append("+ Well-structured with no clarity issues detected") + + result = AxisScore(name="Clarity", score=score, evidence=evidence) + if score < 5: + result.improvement = "Add headings, break long paragraphs, define domain terms on first use" + return result + + +def check_actionability(text: str) -> AxisScore: + """Check if the user can act on the output immediately.""" + evidence = [] + score = 5 + deductions = 0 + + # Positive signals + actionable_signals = [ + (r"(?i)(merge|PR|pull request).*?(created|ready|open)", "PR created"), + (r"(?i)(run|execute)\s+[`\"']?[\w./-]+", "Specific run command given"), + (r"(?i)(next\s+steps?|follow[- ]up|what\s+to\s+do)", "Next steps provided"), + (r"(?i)(file\s+(created|written|modified|updated)\s+at)", "File path specified"), + ] + for pattern, label in actionable_signals: + if re.search(pattern, text): + evidence.append(f"+ {label}") + + # Negative signals + vague_signals = [ + (r"(?i)(you\s+(should|could|might\s+want\s+to))\s+\w+", "Vague suggestion without specifics"), + (r"(?i)(consider|maybe|perhaps)\s+\w+ing", "Non-committal suggestion"), + (r"(?i)(figure\s+out|look\s+into|investigate)\s", "Defers work to user"), + ] + for pattern, label in vague_signals: + if re.search(pattern, text): + deductions += 1 + evidence.append(f"- {label}") + + if deductions >= 3: + score = 2 + elif deductions == 2: + score = 3 + elif deductions == 1: + score = 4 + + if not evidence: + evidence.append("No actionability signals — user may need to ask 'what now?'") + + result = AxisScore(name="Actionability", score=score, evidence=evidence) + if score < 5: + result.improvement = "End with a single clear action: 'Merge this PR', 'Run ./deploy.sh', or 'Review the 3 changed files'" + return result + + +def check_concision(text: str, task: Optional[str] = None) -> AxisScore: + """Check for redundancy, filler, information density.""" + evidence = [] + score = 5 + wc = count_words(text) + + # Heuristic: task-to-output ratio + if task: + task_wc = count_words(task) + ratio = wc / max(task_wc, 1) + if ratio > 15: + evidence.append(f"- Output is {ratio:.0f}x longer than task description (high ratio)") + score = min(score, 3) + elif ratio > 8: + evidence.append(f"- Output is {ratio:.0f}x longer than task description") + score = min(score, 4) + + # Redundancy signals + redundancy_checks = [ + (r"(?i)(as\s+(I|we)\s+(mentioned|said|noted|discussed)\s+(earlier|above|before))", + "Refers back to earlier statement (possible repetition)"), + (r"(?i)(to\s+summarize|in\s+summary|in\s+conclusion|to\s+conclude)", + "Has explicit summary (good if needed, flag if redundant)"), + (r"(?i)(let\s+me\s+(explain|break\s+this\s+down|walk\s+you\s+through))", + "Meta-commentary adds words without information"), + ] + redundant_count = 0 + for pattern, label in redundancy_checks: + matches = re.findall(pattern, text) + if len(matches) > 2: + redundant_count += 1 + evidence.append(f"- '{label}' appears {len(matches)} times") + + if redundant_count >= 2: + score = min(score, 3) + elif redundant_count == 1: + score = min(score, 4) + + if not evidence and score == 5: + evidence.append("+ No redundancy detected. Information density appears good.") + + result = AxisScore(name="Conciseness", score=score, evidence=evidence) + if score < 5: + result.improvement = "Cut meta-commentary, remove repeated points, trim examples to one representative case" + return result + + +def evaluate(task: Optional[str], output: str) -> list[AxisScore]: + """Run all 5 axis checks and return scored results.""" + return [ + check_accuracy(output), + check_completeness(output, task), + check_clarity(output), + check_actionability(output), + check_concision(output, task), + ] + + +def format_report(scores: list[AxisScore]) -> str: + """Format scores into a readable evaluation report.""" + avg = sum(s.score for s in scores) / len(scores) + lines = [] + lines.append("=" * 60) + lines.append("AGENT SELF-EVALUATION REPORT") + lines.append("=" * 60) + lines.append("") + + for s in scores: + bar = "█" * s.score + "░" * (5 - s.score) + lines.append(f" {s.name:<15} {bar} {s.score}/5") + for e in s.evidence: + lines.append(f" {e}") + if s.improvement: + lines.append(f" → {s.improvement}") + lines.append("") + + lines.append(f" {'OVERALL':<15} {avg:.1f}/5") + lines.append("") + + # Top improvements + improvements = [(s, s.improvement) for s in scores if s.improvement and s.score < 4] + if improvements: + lines.append("TOP IMPROVEMENTS (axes scoring < 4):") + for s, imp in sorted(improvements, key=lambda x: x[0].score): + lines.append(f" [{s.name}] {imp}") + else: + lines.append("No axes below 4. Strong output across all dimensions.") + + return "\n".join(lines) + + +def main(): + parser = argparse.ArgumentParser( + description="Evaluate agent output against the 5-axis rubric" + ) + parser.add_argument("--task", help="Task description (file path or inline text)") + parser.add_argument("--output", help="Agent output to evaluate (file path)") + parser.add_argument("--interactive", action="store_true", help="Prompt for task and read output from stdin") + args = parser.parse_args() + + task = None + output = None + + if args.interactive: + task = input("Task description: ").strip() + print("Paste agent output (Ctrl+D to finish):") + output = sys.stdin.read() + elif args.task and args.output: + # Read task + try: + with open(args.task) as f: + task = f.read() + except FileNotFoundError: + task = args.task # Treat as inline text + + # Read output + try: + with open(args.output) as f: + output = f.read() + except FileNotFoundError: + print(f"Error: output file '{args.output}' not found", file=sys.stderr) + sys.exit(1) + else: + # Pipe mode: read output from stdin + output = sys.stdin.read() + if args.task: + try: + with open(args.task) as f: + task = f.read() + except FileNotFoundError: + task = args.task + + if not output: + print("Error: no output to evaluate", file=sys.stderr) + sys.exit(1) + + scores = evaluate(task, output) + print(format_report(scores)) + + +if __name__ == "__main__": + main() diff --git a/skills/agent-self-evaluation/templates/evaluation-report.md b/skills/agent-self-evaluation/templates/evaluation-report.md new file mode 100644 index 00000000..ce29f1ce --- /dev/null +++ b/skills/agent-self-evaluation/templates/evaluation-report.md @@ -0,0 +1,70 @@ +# Agent Self-Evaluation Report Template + +Copy this template and fill in after completing a task. Keep it in your conversation — the user sees it inline. + +``` +============================================================ +AGENT SELF-EVALUATION REPORT +============================================================ + + Accuracy █████ 5/5 or ███░░ 3/5 + + [Evidence: passing tests, verified claims] + - [Gaps: unverified claims, hedging language] + + Completeness █████ 5/5 + + [What's covered: all requirements + edge cases] + - [What's missing: explicitly acknowledge gaps] + + Clarity █████ 5/5 + + [Structure: headings, code blocks, bullet points] + - [Issues: undefined terms, wall of text, no summary] + + Actionability █████ 5/5 + + [User can: merge PR, run command, review file] + - [Blockers: missing steps, vague suggestions] + + Conciseness █████ 5/5 + + [Tight: no repetition, high information density] + - [Bloat: filler, meta-commentary, repeated points] + + OVERALL X.X/5 + +TOP IMPROVEMENTS: + [Only list axes scoring < 4, ranked by user impact] +``` + +## Quick Reference: Scoring Triggers + +| If you see this... | Accuracy | Completeness | Clarity | Actionability | Conciseness | +|---|---|---|---|---|---| +| "should work" / "probably fine" | ≤4 | — | — | — | — | +| "I think" / "I believe" | ≤4 | — | — | — | — | +| No test output cited | ≤4 | — | — | — | — | +| "TODO" / "FIXME" left behind | ≤3 | ≤3 | — | ≤3 | — | +| Missing error handling | — | ≤3 | — | — | — | +| Only happy path covered | — | ≤3 | — | — | — | +| Wall-of-text paragraph (>200 words) | — | — | ≤3 | — | — | +| No headings or structure | — | — | ≤3 | — | — | +| "You should..." without specifics | — | — | — | ≤3 | — | +| No PR or file created | — | — | — | ≤3 | — | +| User needs to figure out next step | — | — | — | ≤2 | — | +| Repeated points (3+ times) | — | — | — | — | ≤3 | +| "Let me explain..." / "To summarize..." x3+ | — | — | — | — | ≤3 | +| Output >15x longer than task | — | — | — | — | ≤3 | + +## When to Skip + +Skip the evaluation if: +- Task was a single tool call (e.g., "read this file" — nothing to evaluate) +- User explicitly says "don't evaluate" or "just do it" +- Task is purely conversational (greeting, small talk) +- You're mid-workflow and the user will judge the final output, not intermediate steps + +## Post-Evaluation Actions + +| Overall Score | What to do | +|---|---| +| ≥4.5 | Deliver. No changes needed. | +| 3.5–4.4 | Flag the top improvement but deliver. Fix if <30 seconds. | +| 2.5–3.4 | State what you'd change. Ask user: "Should I redo [axis] or deliver as-is?" | +| <2.5 | Don't deliver. Say: "This scored __ because __. Let me redo this with [specific fix]." Then redo. | From d0a84db17781bf963e1c6484e1fc4be48564e0c4 Mon Sep 17 00:00:00 2001 From: Hawthorn <217181565+lamenting-hawthorn@users.noreply.github.com> Date: Wed, 10 Jun 2026 17:08:31 +0530 Subject: [PATCH 02/10] Update agents/agent-evaluator.md Co-authored-by: coderabbitai[bot] <136622811+coderabbitai[bot]@users.noreply.github.com> --- agents/agent-evaluator.md | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/agents/agent-evaluator.md b/agents/agent-evaluator.md index dbb7a904..fba475f7 100644 --- a/agents/agent-evaluator.md +++ b/agents/agent-evaluator.md @@ -1,6 +1,6 @@ --- name: agent-evaluator -description: Evaluates agent output against 5-axis quality rubric (accuracy, completeness, clarity, actionability, concision). Use after any non-trivial task when the user wants a quality assessment, or when the agent-self-evaluation skill is active. Produces structured scorecard with evidence and improvement suggestions. +description: Evaluates agent output against 5-axis quality rubric (accuracy, completeness, clarity, actionability, conciseness). Use after any non-trivial task when the user wants a quality assessment, or when the agent-self-evaluation skill is active. Produces structured scorecard with evidence and improvement suggestions. tools: ["Read", "Grep", "Glob", "Bash"] model: sonnet --- From c0f651cf85eacc9064b16e117c0355b307f47721 Mon Sep 17 00:00:00 2001 From: Hawthorn Date: Wed, 10 Jun 2026 17:11:36 +0530 Subject: [PATCH 03/10] fix: align report format across evaluate.py, agent spec, and template MIME-Version: 1.0 Content-Type: text/plain; charset=UTF-8 Content-Transfer-Encoding: 8bit - evaluate.py: add CRITICAL ISSUES (axes ≤ 2) section, VERDICT line - agent-evaluator.md: match format_report output exactly (title, evidence markers, bar graphs) - templates/evaluation-report.md: match evaluate.py output format - All now produce identical AGENT SELF-EVALUATION REPORT structure Single authoritative format: evaluate.py's format_report() output. --- agents/agent-evaluator.md | 149 ++++++++++++------ .../agent-self-evaluation/scripts/evaluate.py | 40 ++++- .../templates/evaluation-report.md | 21 ++- 3 files changed, 147 insertions(+), 63 deletions(-) diff --git a/agents/agent-evaluator.md b/agents/agent-evaluator.md index fba475f7..f4b90a9b 100644 --- a/agents/agent-evaluator.md +++ b/agents/agent-evaluator.md @@ -54,37 +54,49 @@ For each axis: ### Step 4: Produce Report -Use this format: +Use this exact format (matches `scripts/evaluate.py` output): ``` ============================================================ -AGENT EVALUATION REPORT +AGENT SELF-EVALUATION REPORT ============================================================ - Axis Score Evidence + Accuracy █████ 5/5 + + [Evidence: passing tests, verified claims] + → [Improvement if score < 5] - Accuracy X/5 [What was verified, what was wrong] - Completeness X/5 [What's covered, what's missing] - Clarity X/5 [Structure quality, readability] - Actionability X/5 [Can user act now? What's the next step?] - Conciseness X/5 [Information density, redundancy] + Completeness █████ 5/5 + + [What's covered] + → [Improvement if score < 5] - OVERALL X.X/5 + Clarity █████ 5/5 + + [Structure signals] + → [Improvement if score < 5] + + Actionability █████ 5/5 + + [User can act immediately] + → [Improvement if score < 5] + + Conciseness █████ 5/5 + + [Information density] + → [Improvement if score < 5] + + OVERALL X.X/5 CRITICAL ISSUES (axes ≤ 2): - [If any axis scored 2 or below, list it here with the specific fix needed] + [Axis] Score N/5 — specific fix needed + (or "None" if no axis ≤ 2) TOP IMPROVEMENTS: - 1. [Highest impact fix first] + 1. [Highest impact fix] 2. [Second highest] - 3. [Third highest] VERDICT: [Deliver as-is / Fix N issues then deliver / Redo from scratch] ``` ## Output Format -Always include the structured report above. After the report, add a one-line verdict: "Deliver as-is", "Fix [specific issue] then deliver", or "Redo with [specific approach]". +Always include the structured report above, matching the `scripts/evaluate.py` output format exactly. The report title is "AGENT SELF-EVALUATION REPORT" (not "AGENT EVALUATION REPORT"). ## Examples @@ -93,26 +105,44 @@ Always include the structured report above. After the report, add a one-line ver Task: Add retry logic to HTTP client. 3 retries, exponential backoff. ``` -AGENT EVALUATION REPORT +============================================================ +AGENT SELF-EVALUATION REPORT +============================================================ - Accuracy 5/5 grep confirms httpx.Retry used correctly. - Tests pass (42/42). Import verified. - Completeness 4/5 All HTTP methods covered. Missing: connection - pool exhaustion handling (minor edge case). - Clarity 5/5 Well-structured. Summary, code blocks, bullet - points. 10-second scan tells the full story. - Actionability 5/5 Single PR (#423). `pytest -v` cited. Merge is - the only action needed. - Conciseness 4/5 250 words. Verification section slightly - verbose — 3 commands could be 1 script. + Accuracy █████ 5/5 + + Tests passing + + grep confirms httpx.Retry used correctly + + Import verified - OVERALL 4.6/5 + Completeness ████░ 4/5 + + All HTTP methods covered + + Edge cases documented + → Missing: connection pool exhaustion handling (minor edge case) + + Clarity █████ 5/5 + + Uses headings for structure + + Summary in first 3 lines + + Code blocks with language tags + + Actionability █████ 5/5 + + PR #423 created + + pytest -v cited (42 passed) + + Single action: merge PR + + Conciseness ████░ 4/5 + + 250 words, high density + → Verification section slightly verbose — 3 commands could be 1 script + + OVERALL 4.6/5 + +CRITICAL ISSUES (axes ≤ 2): + None TOP IMPROVEMENTS: - 1. Add connection pool exhaustion to edge cases doc - 2. Consolidate verification commands into a single script + 1. [Completeness] Add connection pool exhaustion to edge cases doc + 2. [Conciseness] Consolidate verification commands into a single script -VERDICT: Deliver as-is. The one gap (pool exhaustion) is a P2 edge case. +VERDICT: Deliver as-is. Minor improvements noted above. ``` ### Example: Weak Output @@ -120,33 +150,48 @@ VERDICT: Deliver as-is. The one gap (pool exhaustion) is a P2 edge case. Task: Same as above. ``` -AGENT EVALUATION REPORT +============================================================ +AGENT SELF-EVALUATION REPORT +============================================================ - Accuracy 2/5 CRITICAL: Agent used urllib3.Retry but project - uses httpx. grep proves no urllib3 import exists. - Hedging language: "I think", "probably fine". - Completeness 3/5 Only handles 5xx. Missing: 429 rate limiting, - connection timeouts. Agent acknowledges gaps - ("might be edge cases") but doesn't fix them. - Clarity 3/5 Code is readable but no explanation of where - to integrate. "Add this somewhere" is vague. - Actionability 2/5 No PR, no file created, no test written. - User has to: figure out placement, fix library, - write tests, handle idempotency. - Conciseness 3/5 120 words but ~50% is hedging/disclaimers. - Low information density. + Accuracy ██░░░ 2/5 + + Code block present + - Hedged claim without verification ("I think this should work") + - Explicitly untested + - Speculation without evidence + → Cite specific tool outputs (test results, exit codes, grep findings) - OVERALL 2.6/5 + Completeness ███░░ 3/5 + + Provides code example + - Explicit gap acknowledged ("might be edge cases with POST") + - Limited scope noted (only 5xx, missing 429 and connection errors) + → List what's covered AND what's intentionally excluded -CRITICAL ISSUES: - Accuracy: Wrong library. Use httpx.Retry, not urllib3.Retry. - Actionability: No deliverable. Create a PR with the changed file + tests. + Clarity ████░ 4/5 + + Uses code blocks + - No integration guidance ("add this somewhere" is vague) + → Specify exact file and line where code should be added + + Actionability ██░░░ 2/5 + - Defers work to user ("you'll want to test this") + - Vague suggestion without specifics + → Create a PR with the changed file + tests + + Conciseness ███░░ 3/5 + + Short (120 words) + - Low information density (~50% hedging/disclaimers) + → Cut meta-commentary and filler + + OVERALL 2.8/5 + +CRITICAL ISSUES (axes ≤ 2): + [Accuracy] Score 2/5 — Wrong library. Use httpx.Retry, not urllib3.Retry. + [Actionability] Score 2/5 — No deliverable. Create a PR with test file. TOP IMPROVEMENTS: - 1. Switch to httpx.Retry — grep the codebase first to confirm the HTTP library - 2. Create a PR with src/api_client.py + tests/test_api_client.py - 3. Handle 429, connection errors, and timeout — not just 5xx + 1. [Accuracy] Switch to httpx.Retry — grep the codebase first + 2. [Actionability] Create a PR with src/api_client.py + tests + 3. [Completeness] Handle 429, connection errors, and timeout -VERDICT: Redo with httpx.Retry, full HTTP method coverage, and a test file. - Do not deliver until accuracy ≥ 4. -``` +VERDICT: Redo with specific fixes. Weakest axis: Accuracy (2/5). +``` \ No newline at end of file diff --git a/skills/agent-self-evaluation/scripts/evaluate.py b/skills/agent-self-evaluation/scripts/evaluate.py index 354f5a7a..0446106b 100755 --- a/skills/agent-self-evaluation/scripts/evaluate.py +++ b/skills/agent-self-evaluation/scripts/evaluate.py @@ -306,14 +306,40 @@ def format_report(scores: list[AxisScore]) -> str: lines.append(f" {'OVERALL':<15} {avg:.1f}/5") lines.append("") - # Top improvements - improvements = [(s, s.improvement) for s in scores if s.improvement and s.score < 4] - if improvements: - lines.append("TOP IMPROVEMENTS (axes scoring < 4):") - for s, imp in sorted(improvements, key=lambda x: x[0].score): - lines.append(f" [{s.name}] {imp}") + # Critical issues (axes ≤ 2) + critical = [(s, s.improvement or "No improvement suggested") for s in scores if s.score <= 2] + lines.append("CRITICAL ISSUES (axes ≤ 2):") + if critical: + for s, imp in critical: + lines.append(f" [{s.name}] Score {s.score}/5 — {imp}") else: - lines.append("No axes below 4. Strong output across all dimensions.") + lines.append(" None") + + lines.append("") + + # Top improvements (axes scoring < 4, ranked by impact) + improvements = [(s, s.improvement) for s in scores if s.improvement and s.score < 4] + lines.append("TOP IMPROVEMENTS:") + if improvements: + for i, (s, imp) in enumerate(sorted(improvements, key=lambda x: x[0].score), 1): + lines.append(f" {i}. [{s.name}] {imp}") + else: + lines.append(" No axes below 4. Strong output across all dimensions.") + + lines.append("") + + # Verdict + min_score = min(s.score for s in scores) + if min_score <= 2: + verdict = f"Redo with specific fixes. Weakest axis: {min(scores, key=lambda s: s.score).name} ({min_score}/5)." + elif any(s.score <= 3 for s in scores): + weak = [s.name for s in scores if s.score <= 3] + verdict = f"Fix {'/'.join(weak)} issues, then deliver." + elif avg >= 4.5: + verdict = "Deliver as-is. No changes needed." + else: + verdict = "Deliver as-is. Minor improvements noted above." + lines.append(f"VERDICT: {verdict}") return "\n".join(lines) diff --git a/skills/agent-self-evaluation/templates/evaluation-report.md b/skills/agent-self-evaluation/templates/evaluation-report.md index ce29f1ce..ee0513e2 100644 --- a/skills/agent-self-evaluation/templates/evaluation-report.md +++ b/skills/agent-self-evaluation/templates/evaluation-report.md @@ -1,6 +1,6 @@ # Agent Self-Evaluation Report Template -Copy this template and fill in after completing a task. Keep it in your conversation — the user sees it inline. +Copy this template and fill in after completing a task. The format matches `scripts/evaluate.py` output. ``` ============================================================ @@ -10,27 +10,40 @@ AGENT SELF-EVALUATION REPORT Accuracy █████ 5/5 or ███░░ 3/5 + [Evidence: passing tests, verified claims] - [Gaps: unverified claims, hedging language] + → [Improvement if score < 5] Completeness █████ 5/5 + [What's covered: all requirements + edge cases] - [What's missing: explicitly acknowledge gaps] + → [Improvement if score < 5] Clarity █████ 5/5 + [Structure: headings, code blocks, bullet points] - [Issues: undefined terms, wall of text, no summary] + → [Improvement if score < 5] Actionability █████ 5/5 + [User can: merge PR, run command, review file] - [Blockers: missing steps, vague suggestions] + → [Improvement if score < 5] Conciseness █████ 5/5 + [Tight: no repetition, high information density] - [Bloat: filler, meta-commentary, repeated points] + → [Improvement if score < 5] OVERALL X.X/5 +CRITICAL ISSUES (axes ≤ 2): + [Axis] Score N/5 — specific fix needed + (or "None" if no axis ≤ 2) + TOP IMPROVEMENTS: - [Only list axes scoring < 4, ranked by user impact] + 1. [Highest impact fix] + 2. [Second highest] + (Only list axes scoring < 4, ranked by user impact) + +VERDICT: [Deliver as-is / Fix N issues then deliver / Redo from scratch] ``` ## Quick Reference: Scoring Triggers @@ -64,7 +77,7 @@ Skip the evaluation if: | Overall Score | What to do | |---|---| -| ≥4.5 | Deliver. No changes needed. | -| 3.5–4.4 | Flag the top improvement but deliver. Fix if <30 seconds. | +| ≥4.5 | Deliver as-is. No changes needed. | +| 3.5–4.4 | Flag top improvement but deliver. Fix if <30 seconds. | | 2.5–3.4 | State what you'd change. Ask user: "Should I redo [axis] or deliver as-is?" | | <2.5 | Don't deliver. Say: "This scored __ because __. Let me redo this with [specific fix]." Then redo. | From 2ea4d779a3693db64061ff6d74d587294a1db320 Mon Sep 17 00:00:00 2001 From: Hawthorn Date: Wed, 10 Jun 2026 17:25:24 +0530 Subject: [PATCH 04/10] fix: address self-evaluation review comments - Clarify that agent-evaluator reads skills/agent-self-evaluation/SKILL.md directly - Standardize on Conciseness terminology, including helper names - Remove invalid Stop hook matcher and avoid unsupported command-expression matcher examples - Add explicit hook-integration reference path in SKILL.md - Add summary and self-check fields to evaluate.py output, template, and agent spec - Refactor evaluate.py clarity and input parsing helpers - Remove unused task parameter from check_completeness Validation: - python3 -m py_compile skills/agent-self-evaluation/scripts/evaluate.py - evaluate.py high/low example smoke tests - node scripts/ci/validate-agents.js - node scripts/ci/validate-skills.js - node scripts/ci/validate-hooks.js - node scripts/ci/validate-no-personal-paths.js --- agents/agent-evaluator.md | 13 +- skills/agent-self-evaluation/SKILL.md | 2 +- .../references/hook-integration.md | 15 +- .../agent-self-evaluation/scripts/evaluate.py | 130 +++++++++--------- .../templates/evaluation-report.md | 3 + 5 files changed, 91 insertions(+), 72 deletions(-) diff --git a/agents/agent-evaluator.md b/agents/agent-evaluator.md index f4b90a9b..3169382e 100644 --- a/agents/agent-evaluator.md +++ b/agents/agent-evaluator.md @@ -13,7 +13,7 @@ You are a quality evaluator for AI agent output. Your job is to assess agent res - Every score below 5 MUST cite specific evidence from the output - Provide concrete, actionable improvement suggestions - Maintain objectivity — evaluate the output, not the agent's effort or intent -- Load the `agent-self-evaluation` skill for the detailed scoring rubric +- Read `skills/agent-self-evaluation/SKILL.md` for the detailed scoring rubric. Example input is a standard ECC `SKILL.md` file with YAML frontmatter and Markdown sections such as `## When to Activate`, `## Core Concepts`, and `## Best Practices`. - DO NOT re-perform the original task - DO NOT suggest alternative approaches unless the current approach is factually wrong @@ -60,6 +60,7 @@ Use this exact format (matches `scripts/evaluate.py` output): ============================================================ AGENT SELF-EVALUATION REPORT ============================================================ +Summary: Overall score X.X/5 across 5 quality axes. Accuracy █████ 5/5 + [Evidence: passing tests, verified claims] @@ -87,6 +88,8 @@ CRITICAL ISSUES (axes ≤ 2): [Axis] Score N/5 — specific fix needed (or "None" if no axis ≤ 2) +Self-check: Would the user agree with this assessment? [Yes/No + brief justification] + TOP IMPROVEMENTS: 1. [Highest impact fix] 2. [Second highest] @@ -96,7 +99,7 @@ VERDICT: [Deliver as-is / Fix N issues then deliver / Redo from scratch] ## Output Format -Always include the structured report above, matching the `scripts/evaluate.py` output format exactly. The report title is "AGENT SELF-EVALUATION REPORT" (not "AGENT EVALUATION REPORT"). +Always include the structured report above, matching the `scripts/evaluate.py` output format exactly. The report title is "AGENT SELF-EVALUATION REPORT". ## Examples @@ -108,6 +111,7 @@ Task: Add retry logic to HTTP client. 3 retries, exponential backoff. ============================================================ AGENT SELF-EVALUATION REPORT ============================================================ +Summary: Overall score X.X/5 across 5 quality axes. Accuracy █████ 5/5 + Tests passing @@ -138,6 +142,8 @@ AGENT SELF-EVALUATION REPORT CRITICAL ISSUES (axes ≤ 2): None +Self-check: Would the user agree with this assessment? Yes — the scores cite passing tests, grep verification, and the remaining gaps are minor. + TOP IMPROVEMENTS: 1. [Completeness] Add connection pool exhaustion to edge cases doc 2. [Conciseness] Consolidate verification commands into a single script @@ -153,6 +159,7 @@ Task: Same as above. ============================================================ AGENT SELF-EVALUATION REPORT ============================================================ +Summary: Overall score X.X/5 across 5 quality axes. Accuracy ██░░░ 2/5 + Code block present @@ -188,6 +195,8 @@ CRITICAL ISSUES (axes ≤ 2): [Accuracy] Score 2/5 — Wrong library. Use httpx.Retry, not urllib3.Retry. [Actionability] Score 2/5 — No deliverable. Create a PR with test file. +Self-check: Would the user agree with this assessment? Yes — the report cites the wrong library, lack of tests, and missing deliverable. + TOP IMPROVEMENTS: 1. [Accuracy] Switch to httpx.Retry — grep the codebase first 2. [Actionability] Create a PR with src/api_client.py + tests diff --git a/skills/agent-self-evaluation/SKILL.md b/skills/agent-self-evaluation/SKILL.md index 0aa3c986..96edc164 100644 --- a/skills/agent-self-evaluation/SKILL.md +++ b/skills/agent-self-evaluation/SKILL.md @@ -15,7 +15,7 @@ After completing a complex task, the agent pauses to rate its own output against - After a debugging session that involved 3+ attempts - After producing a design document, architecture decision, or written analysis - When the user asks "how good was that?" or "rate yourself" -- At the end of any session Stop hook (if configured — see References) +- At the end of any session Stop hook (if configured — see `references/hook-integration.md`) ## Core Concepts diff --git a/skills/agent-self-evaluation/references/hook-integration.md b/skills/agent-self-evaluation/references/hook-integration.md index 78246b37..260de2ca 100644 --- a/skills/agent-self-evaluation/references/hook-integration.md +++ b/skills/agent-self-evaluation/references/hook-integration.md @@ -1,13 +1,12 @@ # Hook Integration for Session-Stop Self-Evaluation -Add this hook to `hooks/hooks.json` to automatically trigger self-evaluation at the end of every session: +Add this hook to `hooks/hooks.json` to remind the agent to self-evaluate at the end of every session: ```json { "hooks": { "Stop": [ { - "matcher": "true", "hooks": [ { "type": "command", @@ -21,6 +20,8 @@ Add this hook to `hooks/hooks.json` to automatically trigger self-evaluation at } ``` +`Stop` events do not use a `matcher` field. Keep the hook object limited to `hooks` and metadata such as `description`. + ## Integration with the Python Evaluator The `scripts/evaluate.py` script can be used as a standalone tool: @@ -33,25 +34,27 @@ echo "Your agent response here" | python3 skills/agent-self-evaluation/scripts/e python3 skills/agent-self-evaluation/scripts/evaluate.py --task task.txt --output response.txt ``` -To integrate it into hooks, capture the last agent output to a file first, then run the evaluator: +To integrate it into hooks, capture the last agent output to a file first, then run the evaluator. For lightweight reminders after shell-based verification, use a simple supported matcher string: ```json { "PostToolUse": [ { - "matcher": "tool == \"Bash\" && tool_input.command matches \"(test|pytest|npm test|go test)\"", + "matcher": "Bash", "hooks": [ { "type": "command", - "command": "echo '[Self-Eval] Tests completed. Consider running agent-self-evaluation.'" + "command": "echo '[Self-Eval] If this command completed verification for a non-trivial task, consider running agent-self-evaluation.'" } ], - "description": "Remind agent to self-evaluate after test runs" + "description": "Remind agent to self-evaluate after shell verification" } ] } ``` +This avoids documenting unsupported command-expression matcher syntax. If your harness supports command-level matcher expressions, prefer a word-boundary regex such as `\b(pytest|npm test|go test)\b` rather than a broad `test` substring. + These hooks are opt-in. Add them to your local `hooks/hooks.json` if you want automated evaluation prompts. ## Manual Usage (Recommended) diff --git a/skills/agent-self-evaluation/scripts/evaluate.py b/skills/agent-self-evaluation/scripts/evaluate.py index 0446106b..566242a1 100755 --- a/skills/agent-self-evaluation/scripts/evaluate.py +++ b/skills/agent-self-evaluation/scripts/evaluate.py @@ -83,7 +83,7 @@ def check_accuracy(text: str) -> AxisScore: return result -def check_completeness(text: str, task: Optional[str] = None) -> AxisScore: +def check_completeness(text: str) -> AxisScore: """Check for requirement coverage, edge cases, error handling.""" evidence = [] score = 5 @@ -125,13 +125,36 @@ def check_completeness(text: str, task: Optional[str] = None) -> AxisScore: return result +def _check_jargon(text: str) -> tuple[int, list[str]]: + """Return clarity deductions for unexplained domain jargon.""" + jargon = [ + (r"\b(idempotent|race condition|deadlock|thundering herd)\b", "concurrency"), + (r"\b(exponential backoff|circuit breaker|bulkhead)\b", "resilience"), + (r"\b(ACID|CAP|eventual consistency|linearizability)\b", "database theory"), + ] + explanation_pattern = r"(?i)({domain}|means|refers to|i\.e\.|in other words)" + for pattern, domain in jargon: + has_term = re.search(pattern, text, re.IGNORECASE) + explains_term = re.search(explanation_pattern.format(domain=domain), text) + if has_term and not explains_term: + return 1, [f"- Domain term used without explanation ({domain})"] + return 0, [] + + +def _check_summary(text: str) -> tuple[int, list[str]]: + """Return clarity deduction when long output lacks an early summary.""" + summary_terms = ["summary", "tldr", "overview", "in short"] + has_early_summary = any(term in text[:100].lower() for term in summary_terms) + if not has_early_summary and count_words(text) > 300: + return 1, ["- No summary/TLDR in first 100 words (text is 300+ words)"] + return 0, [] + + def check_clarity(text: str) -> AxisScore: """Check for structure, readability, jargon handling.""" evidence = [] - score = 5 deductions = 0 - # Positive signals if re.search(r"^#{1,3}\s+", text, re.MULTILINE): evidence.append("+ Uses headings for structure") if re.search(r"```", text): @@ -139,33 +162,16 @@ def check_clarity(text: str) -> AxisScore: if re.search(r"^\s*[-*]\s+", text, re.MULTILINE): evidence.append("+ Uses bullet points") - # Negative signals - # Wall of text: long paragraph without breaks - paragraphs = [p for p in text.split("\n\n") if p.strip()] - for p in paragraphs: - if count_words(p) > 200: + for paragraph in [p for p in text.split("\n\n") if p.strip()]: + if count_words(paragraph) > 200: deductions += 1 evidence.append("- Wall-of-text paragraph (>200 words without break)") break - # Jargon without definition - jargon = [ - (r"\b(idempotent|race condition|deadlock|thundering herd)\b", "concurrency"), - (r"\b(exponential backoff|circuit breaker|bulkhead)\b", "resilience"), - (r"\b(ACID|CAP|eventual consistency|linearizability)\b", "database theory"), - ] - for pattern, domain in jargon: - if re.search(pattern, text, re.IGNORECASE): - if not re.search(rf"(?i)({domain}|means|refers to|i\.e\.|in other words)", text): - deductions += 1 - evidence.append(f"- Domain term used without explanation ({domain})") - break - - if not any(t in text[:100].lower() for t in ["summary", "tldr", "overview", "in short"]): - # No early summary — penalize only if text is long - if count_words(text) > 300: - deductions += 1 - evidence.append("- No summary/TLDR in first 100 words (text is 300+ words)") + jargon_deductions, jargon_evidence = _check_jargon(text) + summary_deductions, summary_evidence = _check_summary(text) + deductions += jargon_deductions + summary_deductions + evidence.extend(jargon_evidence + summary_evidence) if deductions >= 3: score = 2 @@ -173,6 +179,8 @@ def check_clarity(text: str) -> AxisScore: score = 3 elif deductions == 1: score = 4 + else: + score = 5 if not evidence: evidence.append("+ Well-structured with no clarity issues detected") @@ -227,7 +235,7 @@ def check_actionability(text: str) -> AxisScore: return result -def check_concision(text: str, task: Optional[str] = None) -> AxisScore: +def check_conciseness(text: str, task: Optional[str] = None) -> AxisScore: """Check for redundancy, filler, information density.""" evidence = [] score = 5 @@ -278,10 +286,10 @@ def evaluate(task: Optional[str], output: str) -> list[AxisScore]: """Run all 5 axis checks and return scored results.""" return [ check_accuracy(output), - check_completeness(output, task), + check_completeness(output), check_clarity(output), check_actionability(output), - check_concision(output, task), + check_conciseness(output, task), ] @@ -292,13 +300,13 @@ def format_report(scores: list[AxisScore]) -> str: lines.append("=" * 60) lines.append("AGENT SELF-EVALUATION REPORT") lines.append("=" * 60) + lines.append(f"Summary: Overall score {avg:.1f}/5 across 5 quality axes.") lines.append("") for s in scores: bar = "█" * s.score + "░" * (5 - s.score) lines.append(f" {s.name:<15} {bar} {s.score}/5") - for e in s.evidence: - lines.append(f" {e}") + lines.extend(f" {e}" for e in s.evidence) if s.improvement: lines.append(f" → {s.improvement}") lines.append("") @@ -316,6 +324,8 @@ def format_report(scores: list[AxisScore]) -> str: lines.append(" None") lines.append("") + lines.append("Self-check: Would the user agree with this assessment? [Yes/No + brief justification]") + lines.append("") # Top improvements (axes scoring < 4, ranked by impact) improvements = [(s, s.improvement) for s in scores if s.improvement and s.score < 4] @@ -344,6 +354,31 @@ def format_report(scores: list[AxisScore]) -> str: return "\n".join(lines) +def _read_file_or_text(path: Optional[str], required: bool = False) -> Optional[str]: + """Read a file path or return inline text when allowed.""" + if path is None: + return None + try: + with open(path) as f: + return f.read() + except FileNotFoundError: + if required: + print(f"Error: output file '{path}' not found", file=sys.stderr) + sys.exit(1) + return path + + +def _read_input(args: argparse.Namespace) -> tuple[Optional[str], str]: + """Read task and output for interactive, file, or pipe mode.""" + if args.interactive: + task = input("Task description: ").strip() + print("Paste agent output (Ctrl+D to finish):") + return task, sys.stdin.read() + if args.output: + return _read_file_or_text(args.task), _read_file_or_text(args.output, required=True) or "" + return _read_file_or_text(args.task), sys.stdin.read() + + def main(): parser = argparse.ArgumentParser( description="Evaluate agent output against the 5-axis rubric" @@ -353,38 +388,7 @@ def main(): parser.add_argument("--interactive", action="store_true", help="Prompt for task and read output from stdin") args = parser.parse_args() - task = None - output = None - - if args.interactive: - task = input("Task description: ").strip() - print("Paste agent output (Ctrl+D to finish):") - output = sys.stdin.read() - elif args.task and args.output: - # Read task - try: - with open(args.task) as f: - task = f.read() - except FileNotFoundError: - task = args.task # Treat as inline text - - # Read output - try: - with open(args.output) as f: - output = f.read() - except FileNotFoundError: - print(f"Error: output file '{args.output}' not found", file=sys.stderr) - sys.exit(1) - else: - # Pipe mode: read output from stdin - output = sys.stdin.read() - if args.task: - try: - with open(args.task) as f: - task = f.read() - except FileNotFoundError: - task = args.task - + task, output = _read_input(args) if not output: print("Error: no output to evaluate", file=sys.stderr) sys.exit(1) diff --git a/skills/agent-self-evaluation/templates/evaluation-report.md b/skills/agent-self-evaluation/templates/evaluation-report.md index ee0513e2..46737092 100644 --- a/skills/agent-self-evaluation/templates/evaluation-report.md +++ b/skills/agent-self-evaluation/templates/evaluation-report.md @@ -6,6 +6,7 @@ Copy this template and fill in after completing a task. The format matches `scri ============================================================ AGENT SELF-EVALUATION REPORT ============================================================ +Summary: Overall score X.X/5 across 5 quality axes. Accuracy █████ 5/5 or ███░░ 3/5 + [Evidence: passing tests, verified claims] @@ -38,6 +39,8 @@ CRITICAL ISSUES (axes ≤ 2): [Axis] Score N/5 — specific fix needed (or "None" if no axis ≤ 2) +Self-check: Would the user agree with this assessment? [Yes/No + brief justification] + TOP IMPROVEMENTS: 1. [Highest impact fix] 2. [Second highest] From 7c0a0049a87751911ece37a439c7cc3cbe1777f8 Mon Sep 17 00:00:00 2001 From: Hawthorn Date: Wed, 10 Jun 2026 17:59:25 +0530 Subject: [PATCH 05/10] fix: address second-round review comments MIME-Version: 1.0 Content-Type: text/plain; charset=UTF-8 Content-Transfer-Encoding: 8bit - Replace httpx.Retry references with correct httpx API usage across all files (httpx has no built-in Retry class; use HTTPTransport/Limits instead) - Fix _check_summary to check first 100 words (not 100 characters) - Fix template to only show → improvement arrow for non-5 scores - Clarify hook documentation: hook echoes reminder, does not run evaluator - Add return type annotation to main() - Make required parameter keyword-only in _read_file_or_text --- agents/agent-evaluator.md | 22 ++++++++----------- skills/agent-self-evaluation/SKILL.md | 6 ++--- .../examples/high-score-example.md | 4 ++-- .../examples/low-score-example.md | 6 ++--- .../references/evaluation-criteria.md | 4 ++-- .../references/hook-integration.md | 2 +- .../agent-self-evaluation/scripts/evaluate.py | 6 ++--- 7 files changed, 23 insertions(+), 27 deletions(-) diff --git a/agents/agent-evaluator.md b/agents/agent-evaluator.md index 3169382e..3a22ee93 100644 --- a/agents/agent-evaluator.md +++ b/agents/agent-evaluator.md @@ -63,24 +63,20 @@ AGENT SELF-EVALUATION REPORT Summary: Overall score X.X/5 across 5 quality axes. Accuracy █████ 5/5 - + [Evidence: passing tests, verified claims] - → [Improvement if score < 5] + + [Evidence: passing tests, verified claims] (no → when score = 5) - Completeness █████ 5/5 + Completeness ████░ 4/5 + [What's covered] - → [Improvement if score < 5] + → [Improvement: only shown when score < 5] Clarity █████ 5/5 - + [Structure signals] - → [Improvement if score < 5] + + [Structure signals] (no → when score = 5) Actionability █████ 5/5 - + [User can act immediately] - → [Improvement if score < 5] + + [User can act immediately] (no → when score = 5) Conciseness █████ 5/5 - + [Information density] - → [Improvement if score < 5] + + [Information density] (no → when score = 5) OVERALL X.X/5 @@ -115,7 +111,7 @@ Summary: Overall score X.X/5 across 5 quality axes. Accuracy █████ 5/5 + Tests passing - + grep confirms httpx.Retry used correctly + + grep confirms httpx transport configured correctly + Import verified Completeness ████░ 4/5 @@ -192,13 +188,13 @@ Summary: Overall score X.X/5 across 5 quality axes. OVERALL 2.8/5 CRITICAL ISSUES (axes ≤ 2): - [Accuracy] Score 2/5 — Wrong library. Use httpx.Retry, not urllib3.Retry. + [Accuracy] Score 2/5 — Wrong library. Use httpx, not urllib3. [Actionability] Score 2/5 — No deliverable. Create a PR with test file. Self-check: Would the user agree with this assessment? Yes — the report cites the wrong library, lack of tests, and missing deliverable. TOP IMPROVEMENTS: - 1. [Accuracy] Switch to httpx.Retry — grep the codebase first + 1. [Accuracy] Switch to httpx — grep the codebase first 2. [Actionability] Create a PR with src/api_client.py + tests 3. [Completeness] Handle 429, connection errors, and timeout diff --git a/skills/agent-self-evaluation/SKILL.md b/skills/agent-self-evaluation/SKILL.md index 96edc164..0e1a2fd6 100644 --- a/skills/agent-self-evaluation/SKILL.md +++ b/skills/agent-self-evaluation/SKILL.md @@ -114,7 +114,7 @@ Overall: 4.6 — One gap (timeout handling). Fix before merging. Task: Add retry logic to HTTP client Scorecard: - Accuracy: 2 — Used urllib3.Retry which doesn't exist in our + Accuracy: 2 — Used urllib3 which doesn't match our httpx-based codebase. Wrong library. Completeness: 3 — Works for GET. POST/PUT not handled (user said "all HTTP requests"). @@ -125,7 +125,7 @@ Scorecard: 3 places instead of one shared RetryConfig object. Overall: 2.8 — Wrong library used. Needs httpx rewrite. - Fix accuracy first (switch to httpx.Retry), then extend to all + Fix accuracy first (switch to httpx), then extend to all HTTP methods, then consolidate config. ``` @@ -171,7 +171,7 @@ The evaluation is about the delivered output, not about re-arguing design decisi - **Evaluate the output, not the process.** The user cares about what you delivered, not how many iterations you took. - **One improvement per weak axis.** Don't list 5 things for one axis — pick the highest-impact gap. - **Tie improvements to user impact.** "Missing error handling means the user's API call will crash silently" beats "add error handling." -- **Be specific about what 'fixed' looks like.** "Re-run with httpx.Retry( total=3, backoff_factor=1.0 )" beats "fix the library issue." +- **Be specific about what 'fixed' looks like.** "Re-run with httpx transport configured for retries" beats "fix the library issue." - **Use tool outputs as evidence.** If tests passed, cite them. If lint is clean, cite it. Don't guess — grep for the proof. - **If you can't find any gaps, try harder.** A perfect score across all 5 axes is rare. Ask: "If I were the user, what would annoy me about this output?" diff --git a/skills/agent-self-evaluation/examples/high-score-example.md b/skills/agent-self-evaluation/examples/high-score-example.md index a2da55b8..46d04587 100644 --- a/skills/agent-self-evaluation/examples/high-score-example.md +++ b/skills/agent-self-evaluation/examples/high-score-example.md @@ -6,10 +6,10 @@ ## Agent Output ### Summary -Added `httpx.Retry` transport to the shared `ApiClient` class. All HTTP methods (GET, POST, PUT, DELETE) now retry on 429, 5xx, and connection errors with exponential backoff. 42 tests pass. Single PR ready to merge. +Added `httpx.HTTPTransport` with retry configuration to the shared `ApiClient` class. All HTTP methods (GET, POST, PUT, DELETE) now retry on 429, 5xx, and connection errors with exponential backoff. 42 tests pass. Single PR ready to merge. ### Files Changed -- `src/api_client.py` — Added `RetryTransport` wrapper with `httpx.Retry( total=3, backoff_factor=1.0, status_forcelist=[429, 500, 502, 503, 504] )` +- `src/api_client.py` — Added `RetryTransport` wrapper with `httpx.HTTPTransport(retries=3)` and exponential backoff configured via `httpx.Limits` - `tests/test_api_client.py` — Added 8 tests: happy path, retry on 429, retry on 503, max retries exhausted, timeout, backoff timing assertion, idempotent GET retry, no-retry on 4xx (except 429) ### Verification diff --git a/skills/agent-self-evaluation/examples/low-score-example.md b/skills/agent-self-evaluation/examples/low-score-example.md index 25047e7a..6fff99f6 100644 --- a/skills/agent-self-evaluation/examples/low-score-example.md +++ b/skills/agent-self-evaluation/examples/low-score-example.md @@ -7,7 +7,7 @@ Here's a retry implementation. I think this should work for most cases. -I used the urllib3 library since it has built-in retry support. The Retry class handles backoff automatically so we just need to configure it. +I used the urllib3 library since it has built-in retry support. The Retry class handles backoff automatically. ```python from urllib3.util import Retry @@ -38,7 +38,7 @@ AGENT SELF-EVALUATION REPORT - Untested ("I haven't tested the timeout behavior") - Speculation without evidence ("those are probably fine") → Wrong library used. Project uses httpx, not urllib3. - urllib3.util.Retry is incompatible with httpx transport. + urllib3.util.Retry is incompatible with httpx. Completeness ███░░ 3/5 - Explicit gap acknowledged ("might be edge cases with POST") @@ -70,7 +70,7 @@ AGENT SELF-EVALUATION REPORT OVERALL 2.8/5 TOP IMPROVEMENTS (axes scoring < 4): - [Accuracy] Switch to httpx.Retry — grep the codebase to confirm the HTTP + [Accuracy] Switch to httpx — grep the codebase to confirm the HTTP library before writing code. [Actionability] Create a PR with the changed file + test file. Run the tests. End with "PR #N ready to merge." diff --git a/skills/agent-self-evaluation/references/evaluation-criteria.md b/skills/agent-self-evaluation/references/evaluation-criteria.md index faf83e7d..9a352bf1 100644 --- a/skills/agent-self-evaluation/references/evaluation-criteria.md +++ b/skills/agent-self-evaluation/references/evaluation-criteria.md @@ -6,8 +6,8 @@ This reference provides concrete scoring anchors for each axis. Use it when you' | Score | Anchor | Example | |---|---|---| -| 5 | All facts verified against tool output, docs, or authoritative sources. No errors. | Used `httpx.Retry` — confirmed in httpx docs. All method names verified with grep against codebase. | -| 4 | One minor inaccuracy that doesn't affect correctness. | Correct library, wrong default value for one parameter (httpx defaults to 1.0s, claimed 0.5s). | +| 5 | All facts verified against tool output, docs, or authoritative sources. No errors. | Configured retry via httpx transport — confirmed in httpx docs. All method names verified with grep against codebase. | +| 4 | One minor inaccuracy that doesn't affect correctness. | Correct library, wrong default value for one parameter (claimed 0.5s, docs say 1.0s). | | 3 | One significant factual error, or 3+ minor inaccuracies. | Used `urllib3.Retry` in an httpx codebase. Works in this one case but wrong library. | | 2 | Multiple significant errors. Output would fail if followed. | Claimed "add this to package.json" but project uses pyproject.toml. Two other config claims also wrong. | | 1 | Fundamentally incorrect. Output contradicts itself or known facts. | Code has syntax errors. API endpoint doesn't exist. Claims a function signature that grep disproves. | diff --git a/skills/agent-self-evaluation/references/hook-integration.md b/skills/agent-self-evaluation/references/hook-integration.md index 260de2ca..066556f0 100644 --- a/skills/agent-self-evaluation/references/hook-integration.md +++ b/skills/agent-self-evaluation/references/hook-integration.md @@ -1,6 +1,6 @@ # Hook Integration for Session-Stop Self-Evaluation -Add this hook to `hooks/hooks.json` to remind the agent to self-evaluate at the end of every session: +Add this hook to `hooks/hooks.json` to remind the agent to self-evaluate at the end of every session (the hook echoes a reminder; it does not run the evaluator automatically): ```json { diff --git a/skills/agent-self-evaluation/scripts/evaluate.py b/skills/agent-self-evaluation/scripts/evaluate.py index 566242a1..f560dc98 100755 --- a/skills/agent-self-evaluation/scripts/evaluate.py +++ b/skills/agent-self-evaluation/scripts/evaluate.py @@ -144,7 +144,7 @@ def _check_jargon(text: str) -> tuple[int, list[str]]: def _check_summary(text: str) -> tuple[int, list[str]]: """Return clarity deduction when long output lacks an early summary.""" summary_terms = ["summary", "tldr", "overview", "in short"] - has_early_summary = any(term in text[:100].lower() for term in summary_terms) + has_early_summary = any(term in ' '.join(text.split()[:100]).lower() for term in summary_terms) if not has_early_summary and count_words(text) > 300: return 1, ["- No summary/TLDR in first 100 words (text is 300+ words)"] return 0, [] @@ -354,7 +354,7 @@ def format_report(scores: list[AxisScore]) -> str: return "\n".join(lines) -def _read_file_or_text(path: Optional[str], required: bool = False) -> Optional[str]: +def _read_file_or_text(path: Optional[str], *, required: bool = False) -> Optional[str]: """Read a file path or return inline text when allowed.""" if path is None: return None @@ -379,7 +379,7 @@ def _read_input(args: argparse.Namespace) -> tuple[Optional[str], str]: return _read_file_or_text(args.task), sys.stdin.read() -def main(): +def main() -> None: parser = argparse.ArgumentParser( description="Evaluate agent output against the 5-axis rubric" ) From 08f66b49095feb034de180576ed9b7aa03ea1537 Mon Sep 17 00:00:00 2001 From: Hawthorn Date: Wed, 10 Jun 2026 18:18:58 +0530 Subject: [PATCH 06/10] fix(agents): add Bash tool guardrails to agent-evaluator List allowed read-only commands (grep, cat, ls, find, head, tail, wc, stat, git log/diff/show) and explicitly forbid destructive commands (rm, mv, chmod, git push, git commit, sudo, pip/npm install, curl|wget piping to sh). Any write/delete/remote-push requires explicit user confirmation. --- agents/agent-evaluator.md | 4 ++++ 1 file changed, 4 insertions(+) diff --git a/agents/agent-evaluator.md b/agents/agent-evaluator.md index 3a22ee93..b827bf44 100644 --- a/agents/agent-evaluator.md +++ b/agents/agent-evaluator.md @@ -20,6 +20,10 @@ You are a quality evaluator for AI agent output. Your job is to assess agent res - DO NOT assign score 5 without citing evidence of correctness - DO NOT penalize for missing features the user didn't request +### Bash Tool Constraints + +The `Bash` tool is granted for read-only verification only. Allowed: `grep`, `cat`, `ls`, `find`, `head`, `tail`, `wc`, `stat`, `git log`, `git diff`, `git show`. Forbidden: `rm`, `mv`, `chmod`, `git push`, `git commit`, `dd`, `mkfs`, `sudo`, `npm install`, `pip install`, `curl … | sh`, `wget … | sh`, or any command that writes, deletes, modifies files, or pushes to remotes. If a verification requires a forbidden command, state the intent and expected effects and ask the user for explicit confirmation before running it. + ## Workflow ### Step 1: Understand the Task From f65ab491be3b748502566e179c63235bffe878da Mon Sep 17 00:00:00 2001 From: Hawthorn Date: Wed, 10 Jun 2026 18:21:12 +0530 Subject: [PATCH 07/10] fix(docs): clarify Stop event matcher is optional, not disallowed MIME-Version: 1.0 Content-Type: text/plain; charset=UTF-8 Content-Transfer-Encoding: 8bit Validator (scripts/ci/validate-hooks.js line 182-184) only errors when matcher is missing for non-EVENTS_WITHOUT_MATCHER events. For Stop (in EVENTS_WITHOUT_MATCHER), matcher is optional — presence is allowed and validated for type correctness, absence is also accepted. --- skills/agent-self-evaluation/references/hook-integration.md | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/skills/agent-self-evaluation/references/hook-integration.md b/skills/agent-self-evaluation/references/hook-integration.md index 066556f0..e56455f3 100644 --- a/skills/agent-self-evaluation/references/hook-integration.md +++ b/skills/agent-self-evaluation/references/hook-integration.md @@ -20,7 +20,7 @@ Add this hook to `hooks/hooks.json` to remind the agent to self-evaluate at the } ``` -`Stop` events do not use a `matcher` field. Keep the hook object limited to `hooks` and metadata such as `description`. +`Stop` events do not require a `matcher` field (it is optional for `Stop`, `Notification`, `UserPromptSubmit`, and `SubagentStop` per `scripts/ci/validate-hooks.js`). If omitted, the hook object only needs `hooks` and metadata such as `description`. ## Integration with the Python Evaluator From 8d360fb46642f406fe95f67a2b3697fcf0238bed Mon Sep 17 00:00:00 2001 From: Hawthorn Date: Wed, 10 Jun 2026 18:27:27 +0530 Subject: [PATCH 08/10] fix: address remaining review nits MIME-Version: 1.0 Content-Type: text/plain; charset=UTF-8 Content-Transfer-Encoding: 8bit - Add top-level hooks wrapper to second JSON example (consistent with hooks.json format) - Extract hardcoded thresholds as module-level constants (WALL_OF_TEXT_WORDS, SUMMARY_CHECK_WORDS, SUMMARY_CHECK_FIRST_N, TASK_OUTPUT_RATIO_HIGH/MEDIUM) Skipped (not applicable): - 'Scoring defaults to 5/5' — by design for heuristic fallback; SKILL.md already documents pairing with LLM judge for production use - '--output silently ignored' — already fixed by _read_input refactor (checks args.output directly, not elif args.task and args.output) --- .../references/hook-integration.md | 26 ++++++++++--------- .../agent-self-evaluation/scripts/evaluate.py | 17 ++++++++---- 2 files changed, 26 insertions(+), 17 deletions(-) diff --git a/skills/agent-self-evaluation/references/hook-integration.md b/skills/agent-self-evaluation/references/hook-integration.md index e56455f3..2bb3c3ed 100644 --- a/skills/agent-self-evaluation/references/hook-integration.md +++ b/skills/agent-self-evaluation/references/hook-integration.md @@ -38,18 +38,20 @@ To integrate it into hooks, capture the last agent output to a file first, then ```json { - "PostToolUse": [ - { - "matcher": "Bash", - "hooks": [ - { - "type": "command", - "command": "echo '[Self-Eval] If this command completed verification for a non-trivial task, consider running agent-self-evaluation.'" - } - ], - "description": "Remind agent to self-evaluate after shell verification" - } - ] + "hooks": { + "PostToolUse": [ + { + "matcher": "Bash", + "hooks": [ + { + "type": "command", + "command": "echo '[Self-Eval] If this command completed verification for a non-trivial task, consider running agent-self-evaluation.'" + } + ], + "description": "Remind agent to self-evaluate after shell verification" + } + ] + } } ``` diff --git a/skills/agent-self-evaluation/scripts/evaluate.py b/skills/agent-self-evaluation/scripts/evaluate.py index f560dc98..2d129c40 100755 --- a/skills/agent-self-evaluation/scripts/evaluate.py +++ b/skills/agent-self-evaluation/scripts/evaluate.py @@ -24,6 +24,13 @@ import sys from dataclasses import dataclass, field from typing import Optional +# Tunable thresholds for evaluation heuristics +WALL_OF_TEXT_WORDS = 200 +SUMMARY_CHECK_WORDS = 300 +SUMMARY_CHECK_FIRST_N = 100 +TASK_OUTPUT_RATIO_HIGH = 15 +TASK_OUTPUT_RATIO_MEDIUM = 8 + @dataclass class AxisScore: @@ -144,8 +151,8 @@ def _check_jargon(text: str) -> tuple[int, list[str]]: def _check_summary(text: str) -> tuple[int, list[str]]: """Return clarity deduction when long output lacks an early summary.""" summary_terms = ["summary", "tldr", "overview", "in short"] - has_early_summary = any(term in ' '.join(text.split()[:100]).lower() for term in summary_terms) - if not has_early_summary and count_words(text) > 300: + has_early_summary = any(term in ' '.join(text.split()[:SUMMARY_CHECK_FIRST_N]).lower() for term in summary_terms) + if not has_early_summary and count_words(text) > SUMMARY_CHECK_WORDS: return 1, ["- No summary/TLDR in first 100 words (text is 300+ words)"] return 0, [] @@ -163,7 +170,7 @@ def check_clarity(text: str) -> AxisScore: evidence.append("+ Uses bullet points") for paragraph in [p for p in text.split("\n\n") if p.strip()]: - if count_words(paragraph) > 200: + if count_words(paragraph) > WALL_OF_TEXT_WORDS: deductions += 1 evidence.append("- Wall-of-text paragraph (>200 words without break)") break @@ -245,10 +252,10 @@ def check_conciseness(text: str, task: Optional[str] = None) -> AxisScore: if task: task_wc = count_words(task) ratio = wc / max(task_wc, 1) - if ratio > 15: + if ratio > TASK_OUTPUT_RATIO_HIGH: evidence.append(f"- Output is {ratio:.0f}x longer than task description (high ratio)") score = min(score, 3) - elif ratio > 8: + elif ratio > TASK_OUTPUT_RATIO_MEDIUM: evidence.append(f"- Output is {ratio:.0f}x longer than task description") score = min(score, 4) From 1e679bcb4775c42d6f59e3727176180d04040484 Mon Sep 17 00:00:00 2001 From: Hawthorn Date: Wed, 10 Jun 2026 18:30:22 +0530 Subject: [PATCH 09/10] fix(agents): harden git commands against pager-based code execution Git commands (log, diff, show) can execute arbitrary code via: - core.pager set in repo-local .git/config - diff.external pointing to an attacker-controlled binary - filter drivers in .gitattributes Mitigation: require --no-pager flag, recommend -c core.pager=cat to disable pager-driven execution. Moved git commands from the unqualified allowlist to a hardened allowlist with explicit flags. --- agents/agent-evaluator.md | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/agents/agent-evaluator.md b/agents/agent-evaluator.md index b827bf44..04317118 100644 --- a/agents/agent-evaluator.md +++ b/agents/agent-evaluator.md @@ -22,7 +22,7 @@ You are a quality evaluator for AI agent output. Your job is to assess agent res ### Bash Tool Constraints -The `Bash` tool is granted for read-only verification only. Allowed: `grep`, `cat`, `ls`, `find`, `head`, `tail`, `wc`, `stat`, `git log`, `git diff`, `git show`. Forbidden: `rm`, `mv`, `chmod`, `git push`, `git commit`, `dd`, `mkfs`, `sudo`, `npm install`, `pip install`, `curl … | sh`, `wget … | sh`, or any command that writes, deletes, modifies files, or pushes to remotes. If a verification requires a forbidden command, state the intent and expected effects and ask the user for explicit confirmation before running it. +The `Bash` tool is granted for read-only verification only. Allowed: `grep`, `cat`, `ls`, `find`, `head`, `tail`, `wc`, `stat`. Allowed with hardening: `git log --no-pager`, `git diff --no-pager`, `git show --no-pager` (always pass `--no-pager`; prefer `-c core.pager=cat` to disable pager-driven code execution via repo-local `.git/config`). Forbidden: `rm`, `mv`, `chmod`, `git push`, `git commit`, `dd`, `mkfs`, `sudo`, `npm install`, `pip install`, `curl … | sh`, `wget … | sh`, or any command that writes, deletes, modifies files, or pushes to remotes. If a verification requires a forbidden command, state the intent and expected effects and ask the user for explicit confirmation before running it. ## Workflow From 149be89d397ef448acd243f000e3c004eb67c48f Mon Sep 17 00:00:00 2001 From: Hawthorn Date: Thu, 11 Jun 2026 17:58:57 +0530 Subject: [PATCH 10/10] fix: address final lint blockers for agent self-evaluation - Replace U+274C cross-mark examples with ASCII FAIL: prefixes - Ensure agent-evaluator markdown ends with trailing newline - Replace markdown placeholder underscores with bracketed placeholders to satisfy markdownlint MD037 --- agents/agent-evaluator.md | 2 +- skills/agent-self-evaluation/SKILL.md | 10 +++++----- .../references/evaluation-criteria.md | 2 +- .../templates/evaluation-report.md | 2 +- 4 files changed, 8 insertions(+), 8 deletions(-) diff --git a/agents/agent-evaluator.md b/agents/agent-evaluator.md index 04317118..c44242ba 100644 --- a/agents/agent-evaluator.md +++ b/agents/agent-evaluator.md @@ -203,4 +203,4 @@ TOP IMPROVEMENTS: 3. [Completeness] Handle 429, connection errors, and timeout VERDICT: Redo with specific fixes. Weakest axis: Accuracy (2/5). -``` \ No newline at end of file +``` diff --git a/skills/agent-self-evaluation/SKILL.md b/skills/agent-self-evaluation/SKILL.md index 0e1a2fd6..4e241380 100644 --- a/skills/agent-self-evaluation/SKILL.md +++ b/skills/agent-self-evaluation/SKILL.md @@ -85,7 +85,7 @@ If any axis scored 3 or below: 1. State what you would do differently 2. If the gap is fixable in < 30 seconds (missing link, unclear phrasing), fix it now -3. If the gap requires rework, flag it explicitly: "This axis scored __ because __. Re-running with __ would likely raise it to __." +3. If the gap requires rework, flag it explicitly: "This axis scored [reason] because [evidence]. Re-running with [specific fix] would likely raise it to [score]." ## Code Examples @@ -134,7 +134,7 @@ Overall: 2.8 — Wrong library used. Needs httpx rewrite. ### "Everything is a 5" ``` -❌ Accuracy: 5 — All good. +FAIL: Accuracy: 5 — All good. Completeness: 5 — Everything covered. Clarity: 5 — Clear. ``` @@ -144,7 +144,7 @@ No evidence cited. This is self-congratulation, not evaluation. A real 5 require ### Over-penalizing for scope creep ``` -❌ Completeness: 2 — Didn't handle WebSocket connections or +FAIL: Completeness: 2 — Didn't handle WebSocket connections or gRPC streaming (user didn't ask for these) ``` @@ -153,7 +153,7 @@ Only evaluate against what the user actually requested, not what you could have ### Using the evaluation to re-litigate ``` -❌ "As I said earlier, this approach is wrong. Score: 1" +FAIL: "As I said earlier, this approach is wrong. Score: 1" ``` The evaluation is about the delivered output, not about re-arguing design decisions that were already made. If the approach was wrong, that should have been caught before delivery. @@ -161,7 +161,7 @@ The evaluation is about the delivered output, not about re-arguing design decisi ### Mixing personal preference with objective gaps ``` -❌ "Score: 3. I don't like Python decorators." +FAIL: "Score: 3. I don't like Python decorators." ``` "Don't like" is not evidence. Cite a concrete readability, testability, or correctness concern, or leave the score at 4+. diff --git a/skills/agent-self-evaluation/references/evaluation-criteria.md b/skills/agent-self-evaluation/references/evaluation-criteria.md index 9a352bf1..fbb3cf90 100644 --- a/skills/agent-self-evaluation/references/evaluation-criteria.md +++ b/skills/agent-self-evaluation/references/evaluation-criteria.md @@ -56,7 +56,7 @@ This reference provides concrete scoring anchors for each axis. Use it when you' ### When the user gave unclear instructions -If the user's request was ambiguous, do NOT penalize completeness for not reading minds. Instead, note in the evaluation: "User's request was ambiguous about __. I chose interpretation __. If they meant __, this score would drop to __." +If the user's request was ambiguous, do NOT penalize completeness for not reading minds. Instead, note in the evaluation: "User's request was ambiguous about [scope]. I chose interpretation [chosen interpretation]. If they meant [alternative interpretation], this score would drop to [score]." ### When the task is inherently simple diff --git a/skills/agent-self-evaluation/templates/evaluation-report.md b/skills/agent-self-evaluation/templates/evaluation-report.md index 46737092..bbc06d4b 100644 --- a/skills/agent-self-evaluation/templates/evaluation-report.md +++ b/skills/agent-self-evaluation/templates/evaluation-report.md @@ -83,4 +83,4 @@ Skip the evaluation if: | ≥4.5 | Deliver as-is. No changes needed. | | 3.5–4.4 | Flag top improvement but deliver. Fix if <30 seconds. | | 2.5–3.4 | State what you'd change. Ask user: "Should I redo [axis] or deliver as-is?" | -| <2.5 | Don't deliver. Say: "This scored __ because __. Let me redo this with [specific fix]." Then redo. | +| <2.5 | Don't deliver. Say: "This scored [score] because [evidence]. Let me redo this with [specific fix]." Then redo. |