diff --git a/agents/agent-evaluator.md b/agents/agent-evaluator.md index 3169382e..3a22ee93 100644 --- a/agents/agent-evaluator.md +++ b/agents/agent-evaluator.md @@ -63,24 +63,20 @@ AGENT SELF-EVALUATION REPORT Summary: Overall score X.X/5 across 5 quality axes. Accuracy █████ 5/5 - + [Evidence: passing tests, verified claims] - → [Improvement if score < 5] + + [Evidence: passing tests, verified claims] (no → when score = 5) - Completeness █████ 5/5 + Completeness ████░ 4/5 + [What's covered] - → [Improvement if score < 5] + → [Improvement: only shown when score < 5] Clarity █████ 5/5 - + [Structure signals] - → [Improvement if score < 5] + + [Structure signals] (no → when score = 5) Actionability █████ 5/5 - + [User can act immediately] - → [Improvement if score < 5] + + [User can act immediately] (no → when score = 5) Conciseness █████ 5/5 - + [Information density] - → [Improvement if score < 5] + + [Information density] (no → when score = 5) OVERALL X.X/5 @@ -115,7 +111,7 @@ Summary: Overall score X.X/5 across 5 quality axes. Accuracy █████ 5/5 + Tests passing - + grep confirms httpx.Retry used correctly + + grep confirms httpx transport configured correctly + Import verified Completeness ████░ 4/5 @@ -192,13 +188,13 @@ Summary: Overall score X.X/5 across 5 quality axes. OVERALL 2.8/5 CRITICAL ISSUES (axes ≤ 2): - [Accuracy] Score 2/5 — Wrong library. Use httpx.Retry, not urllib3.Retry. + [Accuracy] Score 2/5 — Wrong library. Use httpx, not urllib3. [Actionability] Score 2/5 — No deliverable. Create a PR with test file. Self-check: Would the user agree with this assessment? Yes — the report cites the wrong library, lack of tests, and missing deliverable. TOP IMPROVEMENTS: - 1. [Accuracy] Switch to httpx.Retry — grep the codebase first + 1. [Accuracy] Switch to httpx — grep the codebase first 2. [Actionability] Create a PR with src/api_client.py + tests 3. [Completeness] Handle 429, connection errors, and timeout diff --git a/skills/agent-self-evaluation/SKILL.md b/skills/agent-self-evaluation/SKILL.md index 96edc164..0e1a2fd6 100644 --- a/skills/agent-self-evaluation/SKILL.md +++ b/skills/agent-self-evaluation/SKILL.md @@ -114,7 +114,7 @@ Overall: 4.6 — One gap (timeout handling). Fix before merging. Task: Add retry logic to HTTP client Scorecard: - Accuracy: 2 — Used urllib3.Retry which doesn't exist in our + Accuracy: 2 — Used urllib3 which doesn't match our httpx-based codebase. Wrong library. Completeness: 3 — Works for GET. POST/PUT not handled (user said "all HTTP requests"). @@ -125,7 +125,7 @@ Scorecard: 3 places instead of one shared RetryConfig object. Overall: 2.8 — Wrong library used. Needs httpx rewrite. - Fix accuracy first (switch to httpx.Retry), then extend to all + Fix accuracy first (switch to httpx), then extend to all HTTP methods, then consolidate config. ``` @@ -171,7 +171,7 @@ The evaluation is about the delivered output, not about re-arguing design decisi - **Evaluate the output, not the process.** The user cares about what you delivered, not how many iterations you took. - **One improvement per weak axis.** Don't list 5 things for one axis — pick the highest-impact gap. - **Tie improvements to user impact.** "Missing error handling means the user's API call will crash silently" beats "add error handling." -- **Be specific about what 'fixed' looks like.** "Re-run with httpx.Retry( total=3, backoff_factor=1.0 )" beats "fix the library issue." +- **Be specific about what 'fixed' looks like.** "Re-run with httpx transport configured for retries" beats "fix the library issue." - **Use tool outputs as evidence.** If tests passed, cite them. If lint is clean, cite it. Don't guess — grep for the proof. - **If you can't find any gaps, try harder.** A perfect score across all 5 axes is rare. Ask: "If I were the user, what would annoy me about this output?" diff --git a/skills/agent-self-evaluation/examples/high-score-example.md b/skills/agent-self-evaluation/examples/high-score-example.md index a2da55b8..46d04587 100644 --- a/skills/agent-self-evaluation/examples/high-score-example.md +++ b/skills/agent-self-evaluation/examples/high-score-example.md @@ -6,10 +6,10 @@ ## Agent Output ### Summary -Added `httpx.Retry` transport to the shared `ApiClient` class. All HTTP methods (GET, POST, PUT, DELETE) now retry on 429, 5xx, and connection errors with exponential backoff. 42 tests pass. Single PR ready to merge. +Added `httpx.HTTPTransport` with retry configuration to the shared `ApiClient` class. All HTTP methods (GET, POST, PUT, DELETE) now retry on 429, 5xx, and connection errors with exponential backoff. 42 tests pass. Single PR ready to merge. ### Files Changed -- `src/api_client.py` — Added `RetryTransport` wrapper with `httpx.Retry( total=3, backoff_factor=1.0, status_forcelist=[429, 500, 502, 503, 504] )` +- `src/api_client.py` — Added `RetryTransport` wrapper with `httpx.HTTPTransport(retries=3)` and exponential backoff configured via `httpx.Limits` - `tests/test_api_client.py` — Added 8 tests: happy path, retry on 429, retry on 503, max retries exhausted, timeout, backoff timing assertion, idempotent GET retry, no-retry on 4xx (except 429) ### Verification diff --git a/skills/agent-self-evaluation/examples/low-score-example.md b/skills/agent-self-evaluation/examples/low-score-example.md index 25047e7a..6fff99f6 100644 --- a/skills/agent-self-evaluation/examples/low-score-example.md +++ b/skills/agent-self-evaluation/examples/low-score-example.md @@ -7,7 +7,7 @@ Here's a retry implementation. I think this should work for most cases. -I used the urllib3 library since it has built-in retry support. The Retry class handles backoff automatically so we just need to configure it. +I used the urllib3 library since it has built-in retry support. The Retry class handles backoff automatically. ```python from urllib3.util import Retry @@ -38,7 +38,7 @@ AGENT SELF-EVALUATION REPORT - Untested ("I haven't tested the timeout behavior") - Speculation without evidence ("those are probably fine") → Wrong library used. Project uses httpx, not urllib3. - urllib3.util.Retry is incompatible with httpx transport. + urllib3.util.Retry is incompatible with httpx. Completeness ███░░ 3/5 - Explicit gap acknowledged ("might be edge cases with POST") @@ -70,7 +70,7 @@ AGENT SELF-EVALUATION REPORT OVERALL 2.8/5 TOP IMPROVEMENTS (axes scoring < 4): - [Accuracy] Switch to httpx.Retry — grep the codebase to confirm the HTTP + [Accuracy] Switch to httpx — grep the codebase to confirm the HTTP library before writing code. [Actionability] Create a PR with the changed file + test file. Run the tests. End with "PR #N ready to merge." diff --git a/skills/agent-self-evaluation/references/evaluation-criteria.md b/skills/agent-self-evaluation/references/evaluation-criteria.md index faf83e7d..9a352bf1 100644 --- a/skills/agent-self-evaluation/references/evaluation-criteria.md +++ b/skills/agent-self-evaluation/references/evaluation-criteria.md @@ -6,8 +6,8 @@ This reference provides concrete scoring anchors for each axis. Use it when you' | Score | Anchor | Example | |---|---|---| -| 5 | All facts verified against tool output, docs, or authoritative sources. No errors. | Used `httpx.Retry` — confirmed in httpx docs. All method names verified with grep against codebase. | -| 4 | One minor inaccuracy that doesn't affect correctness. | Correct library, wrong default value for one parameter (httpx defaults to 1.0s, claimed 0.5s). | +| 5 | All facts verified against tool output, docs, or authoritative sources. No errors. | Configured retry via httpx transport — confirmed in httpx docs. All method names verified with grep against codebase. | +| 4 | One minor inaccuracy that doesn't affect correctness. | Correct library, wrong default value for one parameter (claimed 0.5s, docs say 1.0s). | | 3 | One significant factual error, or 3+ minor inaccuracies. | Used `urllib3.Retry` in an httpx codebase. Works in this one case but wrong library. | | 2 | Multiple significant errors. Output would fail if followed. | Claimed "add this to package.json" but project uses pyproject.toml. Two other config claims also wrong. | | 1 | Fundamentally incorrect. Output contradicts itself or known facts. | Code has syntax errors. API endpoint doesn't exist. Claims a function signature that grep disproves. | diff --git a/skills/agent-self-evaluation/references/hook-integration.md b/skills/agent-self-evaluation/references/hook-integration.md index 260de2ca..066556f0 100644 --- a/skills/agent-self-evaluation/references/hook-integration.md +++ b/skills/agent-self-evaluation/references/hook-integration.md @@ -1,6 +1,6 @@ # Hook Integration for Session-Stop Self-Evaluation -Add this hook to `hooks/hooks.json` to remind the agent to self-evaluate at the end of every session: +Add this hook to `hooks/hooks.json` to remind the agent to self-evaluate at the end of every session (the hook echoes a reminder; it does not run the evaluator automatically): ```json { diff --git a/skills/agent-self-evaluation/scripts/evaluate.py b/skills/agent-self-evaluation/scripts/evaluate.py index 566242a1..f560dc98 100755 --- a/skills/agent-self-evaluation/scripts/evaluate.py +++ b/skills/agent-self-evaluation/scripts/evaluate.py @@ -144,7 +144,7 @@ def _check_jargon(text: str) -> tuple[int, list[str]]: def _check_summary(text: str) -> tuple[int, list[str]]: """Return clarity deduction when long output lacks an early summary.""" summary_terms = ["summary", "tldr", "overview", "in short"] - has_early_summary = any(term in text[:100].lower() for term in summary_terms) + has_early_summary = any(term in ' '.join(text.split()[:100]).lower() for term in summary_terms) if not has_early_summary and count_words(text) > 300: return 1, ["- No summary/TLDR in first 100 words (text is 300+ words)"] return 0, [] @@ -354,7 +354,7 @@ def format_report(scores: list[AxisScore]) -> str: return "\n".join(lines) -def _read_file_or_text(path: Optional[str], required: bool = False) -> Optional[str]: +def _read_file_or_text(path: Optional[str], *, required: bool = False) -> Optional[str]: """Read a file path or return inline text when allowed.""" if path is None: return None @@ -379,7 +379,7 @@ def _read_input(args: argparse.Namespace) -> tuple[Optional[str], str]: return _read_file_or_text(args.task), sys.stdin.read() -def main(): +def main() -> None: parser = argparse.ArgumentParser( description="Evaluate agent output against the 5-axis rubric" )