fix: address second-round review comments

- Replace httpx.Retry references with correct httpx API usage across all files (httpx has no built-in Retry class; use HTTPTransport/Limits instead) - Fix _check_summary to check first 100 words (not 100 characters) - Fix template to only show → improvement arrow for non-5 scores - Clarify hook documentation: hook echoes reminder, does not run evaluator - Add return type annotation to main() - Make required parameter keyword-only in _read_file_or_text
2026-06-20 03:40:29 +08:00 · 2026-06-10 17:59:25 +05:30 · 2026-06-10 17:59:25 +05:30 · 7c0a0049a8
commit 7c0a0049a8
parent 2ea4d779a3
7 changed files with 23 additions and 27 deletions
--- a/agents/agent-evaluator.md
+++ b/agents/agent-evaluator.md
@ -63,24 +63,20 @@ AGENT SELF-EVALUATION REPORT
 Summary: Overall score X.X/5 across 5 quality axes.
  Accuracy         █████ 5/5
-    + [Evidence: passing tests, verified claims]
+    + [Evidence: passing tests, verified claims]  (no → when score = 5)
    → [Improvement if score < 5]
-  Completeness      █████ 5/5
+  Completeness      ████░ 4/5
    + [What's covered]
-    → [Improvement if score < 5]
+    → [Improvement: only shown when score < 5]
  Clarity           █████ 5/5
-    + [Structure signals]
+    + [Structure signals]  (no → when score = 5)
    → [Improvement if score < 5]
  Actionability     █████ 5/5
-    + [User can act immediately]
+    + [User can act immediately]  (no → when score = 5)
    → [Improvement if score < 5]
  Conciseness       █████ 5/5
-    + [Information density]
+    + [Information density]  (no → when score = 5)
    → [Improvement if score < 5]
  OVERALL           X.X/5
@ -115,7 +111,7 @@ Summary: Overall score X.X/5 across 5 quality axes.
  Accuracy         █████ 5/5
    + Tests passing
-    + grep confirms httpx.Retry used correctly
+    + grep confirms httpx transport configured correctly
    + Import verified
  Completeness      ████░ 4/5
@ -192,13 +188,13 @@ Summary: Overall score X.X/5 across 5 quality axes.
  OVERALL           2.8/5
 CRITICAL ISSUES (axes ≤ 2):
-  [Accuracy] Score 2/5 — Wrong library. Use httpx.Retry, not urllib3.Retry.
+  [Accuracy] Score 2/5 — Wrong library. Use httpx, not urllib3.
  [Actionability] Score 2/5 — No deliverable. Create a PR with test file.
 Self-check: Would the user agree with this assessment? Yes — the report cites the wrong library, lack of tests, and missing deliverable.
 TOP IMPROVEMENTS:
-  1. [Accuracy] Switch to httpx.Retry — grep the codebase first
+  1. [Accuracy] Switch to httpx — grep the codebase first
  2. [Actionability] Create a PR with src/api_client.py + tests
  3. [Completeness] Handle 429, connection errors, and timeout
--- a/skills/agent-self-evaluation/SKILL.md
+++ b/skills/agent-self-evaluation/SKILL.md
@ -114,7 +114,7 @@ Overall: 4.6 — One gap (timeout handling). Fix before merging.
 Task: Add retry logic to HTTP client
 Scorecard:
-  Accuracy:    2 — Used urllib3.Retry which doesn't exist in our
+  Accuracy:    2 — Used urllib3 which doesn't match our
                  httpx-based codebase. Wrong library.
  Completeness: 3 — Works for GET. POST/PUT not handled (user
                  said "all HTTP requests").
@ -125,7 +125,7 @@ Scorecard:
                  3 places instead of one shared RetryConfig object.
 Overall: 2.8 — Wrong library used. Needs httpx rewrite.
-  Fix accuracy first (switch to httpx.Retry), then extend to all
+  Fix accuracy first (switch to httpx), then extend to all
  HTTP methods, then consolidate config.
 ```
@ -171,7 +171,7 @@ The evaluation is about the delivered output, not about re-arguing design decisi
 - **Evaluate the output, not the process.** The user cares about what you delivered, not how many iterations you took.
 - **One improvement per weak axis.** Don't list 5 things for one axis — pick the highest-impact gap.
 - **Tie improvements to user impact.** "Missing error handling means the user's API call will crash silently" beats "add error handling."
- **Be specific about what 'fixed' looks like.** "Re-run with httpx.Retry( total=3, backoff_factor=1.0 )" beats "fix the library issue."
+- **Be specific about what 'fixed' looks like.** "Re-run with httpx transport configured for retries" beats "fix the library issue."
 - **Use tool outputs as evidence.** If tests passed, cite them. If lint is clean, cite it. Don't guess — grep for the proof.
 - **If you can't find any gaps, try harder.** A perfect score across all 5 axes is rare. Ask: "If I were the user, what would annoy me about this output?"
--- a/skills/agent-self-evaluation/examples/high-score-example.md
+++ b/skills/agent-self-evaluation/examples/high-score-example.md
@ -6,10 +6,10 @@
 ## Agent Output
 ### Summary
-Added `httpx.Retry` transport to the shared `ApiClient` class. All HTTP methods (GET, POST, PUT, DELETE) now retry on 429, 5xx, and connection errors with exponential backoff. 42 tests pass. Single PR ready to merge.
+Added `httpx.HTTPTransport` with retry configuration to the shared `ApiClient` class. All HTTP methods (GET, POST, PUT, DELETE) now retry on 429, 5xx, and connection errors with exponential backoff. 42 tests pass. Single PR ready to merge.
 ### Files Changed
- `src/api_client.py` — Added `RetryTransport` wrapper with `httpx.Retry( total=3, backoff_factor=1.0, status_forcelist=[429, 500, 502, 503, 504] )`
+- `src/api_client.py` — Added `RetryTransport` wrapper with `httpx.HTTPTransport(retries=3)` and exponential backoff configured via `httpx.Limits`
 - `tests/test_api_client.py` — Added 8 tests: happy path, retry on 429, retry on 503, max retries exhausted, timeout, backoff timing assertion, idempotent GET retry, no-retry on 4xx (except 429)
 ### Verification
--- a/skills/agent-self-evaluation/examples/low-score-example.md
+++ b/skills/agent-self-evaluation/examples/low-score-example.md
@ -7,7 +7,7 @@
 Here's a retry implementation. I think this should work for most cases.
-I used the urllib3 library since it has built-in retry support. The Retry class handles backoff automatically so we just need to configure it.
+I used the urllib3 library since it has built-in retry support. The Retry class handles backoff automatically.
 ```python
 from urllib3.util import Retry
@ -38,7 +38,7 @@ AGENT SELF-EVALUATION REPORT
    - Untested ("I haven't tested the timeout behavior")
    - Speculation without evidence ("those are probably fine")
    → Wrong library used. Project uses httpx, not urllib3.
-      urllib3.util.Retry is incompatible with httpx transport.
+                      urllib3.util.Retry is incompatible with httpx.
  Completeness      ███░░ 3/5
    - Explicit gap acknowledged ("might be edge cases with POST")
@ -70,7 +70,7 @@ AGENT SELF-EVALUATION REPORT
  OVERALL           2.8/5
 TOP IMPROVEMENTS (axes scoring < 4):
-  [Accuracy] Switch to httpx.Retry — grep the codebase to confirm the HTTP
+  [Accuracy] Switch to httpx — grep the codebase to confirm the HTTP
    library before writing code.
  [Actionability] Create a PR with the changed file + test file. Run the
    tests. End with "PR #N ready to merge."
--- a/skills/agent-self-evaluation/references/evaluation-criteria.md
+++ b/skills/agent-self-evaluation/references/evaluation-criteria.md
@ -6,8 +6,8 @@ This reference provides concrete scoring anchors for each axis. Use it when you'
 | Score | Anchor | Example |
 |---|---|---|
-| 5 | All facts verified against tool output, docs, or authoritative sources. No errors. | Used `httpx.Retry` — confirmed in httpx docs. All method names verified with grep against codebase. |
+| 5 | All facts verified against tool output, docs, or authoritative sources. No errors. | Configured retry via httpx transport — confirmed in httpx docs. All method names verified with grep against codebase. |
-| 4 | One minor inaccuracy that doesn't affect correctness. | Correct library, wrong default value for one parameter (httpx defaults to 1.0s, claimed 0.5s). |
+| 4 | One minor inaccuracy that doesn't affect correctness. | Correct library, wrong default value for one parameter (claimed 0.5s, docs say 1.0s). |
 | 3 | One significant factual error, or 3+ minor inaccuracies. | Used `urllib3.Retry` in an httpx codebase. Works in this one case but wrong library. |
 | 2 | Multiple significant errors. Output would fail if followed. | Claimed "add this to package.json" but project uses pyproject.toml. Two other config claims also wrong. |
 | 1 | Fundamentally incorrect. Output contradicts itself or known facts. | Code has syntax errors. API endpoint doesn't exist. Claims a function signature that grep disproves. |
--- a/skills/agent-self-evaluation/references/hook-integration.md
+++ b/skills/agent-self-evaluation/references/hook-integration.md
@ -1,6 +1,6 @@
 # Hook Integration for Session-Stop Self-Evaluation
-Add this hook to `hooks/hooks.json` to remind the agent to self-evaluate at the end of every session:
+Add this hook to `hooks/hooks.json` to remind the agent to self-evaluate at the end of every session (the hook echoes a reminder; it does not run the evaluator automatically):
 ```json
 {
--- a/skills/agent-self-evaluation/scripts/evaluate.py
+++ b/skills/agent-self-evaluation/scripts/evaluate.py
@ -144,7 +144,7 @@ def _check_jargon(text: str) -> tuple[int, list[str]]:
 def _check_summary(text: str) -> tuple[int, list[str]]:
    """Return clarity deduction when long output lacks an early summary."""
    summary_terms = ["summary", "tldr", "overview", "in short"]
-    has_early_summary = any(term in text[:100].lower() for term in summary_terms)
+    has_early_summary = any(term in ' '.join(text.split()[:100]).lower() for term in summary_terms)
    if not has_early_summary and count_words(text) > 300:
        return 1, ["- No summary/TLDR in first 100 words (text is 300+ words)"]
    return 0, []
@ -354,7 +354,7 @@ def format_report(scores: list[AxisScore]) -> str:
    return "\n".join(lines)
-def _read_file_or_text(path: Optional[str], required: bool = False) -> Optional[str]:
+def _read_file_or_text(path: Optional[str], *, required: bool = False) -> Optional[str]:
    """Read a file path or return inline text when allowed."""
    if path is None:
        return None
@ -379,7 +379,7 @@ def _read_input(args: argparse.Namespace) -> tuple[Optional[str], str]:
    return _read_file_or_text(args.task), sys.stdin.read()
-def main():
+def main() -> None:
    parser = argparse.ArgumentParser(
        description="Evaluate agent output against the 5-axis rubric"
    )