fix: address second-round review comments

- Replace httpx.Retry references with correct httpx API usage across all files
  (httpx has no built-in Retry class; use HTTPTransport/Limits instead)
- Fix _check_summary to check first 100 words (not 100 characters)
- Fix template to only show → improvement arrow for non-5 scores
- Clarify hook documentation: hook echoes reminder, does not run evaluator
- Add return type annotation to main()
- Make required parameter keyword-only in _read_file_or_text
This commit is contained in:
Hawthorn 2026-06-10 17:59:25 +05:30
parent 2ea4d779a3
commit 7c0a0049a8
7 changed files with 23 additions and 27 deletions

View File

@ -63,24 +63,20 @@ AGENT SELF-EVALUATION REPORT
Summary: Overall score X.X/5 across 5 quality axes. Summary: Overall score X.X/5 across 5 quality axes.
Accuracy █████ 5/5 Accuracy █████ 5/5
+ [Evidence: passing tests, verified claims] + [Evidence: passing tests, verified claims] (no → when score = 5)
→ [Improvement if score < 5]
Completeness █████ 5/5 Completeness ████░ 4/5
+ [What's covered] + [What's covered]
→ [Improvement if score < 5] → [Improvement: only shown when score < 5]
Clarity █████ 5/5 Clarity █████ 5/5
+ [Structure signals] + [Structure signals] (no → when score = 5)
→ [Improvement if score < 5]
Actionability █████ 5/5 Actionability █████ 5/5
+ [User can act immediately] + [User can act immediately] (no → when score = 5)
→ [Improvement if score < 5]
Conciseness █████ 5/5 Conciseness █████ 5/5
+ [Information density] + [Information density] (no → when score = 5)
→ [Improvement if score < 5]
OVERALL X.X/5 OVERALL X.X/5
@ -115,7 +111,7 @@ Summary: Overall score X.X/5 across 5 quality axes.
Accuracy █████ 5/5 Accuracy █████ 5/5
+ Tests passing + Tests passing
+ grep confirms httpx.Retry used correctly + grep confirms httpx transport configured correctly
+ Import verified + Import verified
Completeness ████░ 4/5 Completeness ████░ 4/5
@ -192,13 +188,13 @@ Summary: Overall score X.X/5 across 5 quality axes.
OVERALL 2.8/5 OVERALL 2.8/5
CRITICAL ISSUES (axes ≤ 2): CRITICAL ISSUES (axes ≤ 2):
[Accuracy] Score 2/5 — Wrong library. Use httpx.Retry, not urllib3.Retry. [Accuracy] Score 2/5 — Wrong library. Use httpx, not urllib3.
[Actionability] Score 2/5 — No deliverable. Create a PR with test file. [Actionability] Score 2/5 — No deliverable. Create a PR with test file.
Self-check: Would the user agree with this assessment? Yes — the report cites the wrong library, lack of tests, and missing deliverable. Self-check: Would the user agree with this assessment? Yes — the report cites the wrong library, lack of tests, and missing deliverable.
TOP IMPROVEMENTS: TOP IMPROVEMENTS:
1. [Accuracy] Switch to httpx.Retry — grep the codebase first 1. [Accuracy] Switch to httpx — grep the codebase first
2. [Actionability] Create a PR with src/api_client.py + tests 2. [Actionability] Create a PR with src/api_client.py + tests
3. [Completeness] Handle 429, connection errors, and timeout 3. [Completeness] Handle 429, connection errors, and timeout

View File

@ -114,7 +114,7 @@ Overall: 4.6 — One gap (timeout handling). Fix before merging.
Task: Add retry logic to HTTP client Task: Add retry logic to HTTP client
Scorecard: Scorecard:
Accuracy: 2 — Used urllib3.Retry which doesn't exist in our Accuracy: 2 — Used urllib3 which doesn't match our
httpx-based codebase. Wrong library. httpx-based codebase. Wrong library.
Completeness: 3 — Works for GET. POST/PUT not handled (user Completeness: 3 — Works for GET. POST/PUT not handled (user
said "all HTTP requests"). said "all HTTP requests").
@ -125,7 +125,7 @@ Scorecard:
3 places instead of one shared RetryConfig object. 3 places instead of one shared RetryConfig object.
Overall: 2.8 — Wrong library used. Needs httpx rewrite. Overall: 2.8 — Wrong library used. Needs httpx rewrite.
Fix accuracy first (switch to httpx.Retry), then extend to all Fix accuracy first (switch to httpx), then extend to all
HTTP methods, then consolidate config. HTTP methods, then consolidate config.
``` ```
@ -171,7 +171,7 @@ The evaluation is about the delivered output, not about re-arguing design decisi
- **Evaluate the output, not the process.** The user cares about what you delivered, not how many iterations you took. - **Evaluate the output, not the process.** The user cares about what you delivered, not how many iterations you took.
- **One improvement per weak axis.** Don't list 5 things for one axis — pick the highest-impact gap. - **One improvement per weak axis.** Don't list 5 things for one axis — pick the highest-impact gap.
- **Tie improvements to user impact.** "Missing error handling means the user's API call will crash silently" beats "add error handling." - **Tie improvements to user impact.** "Missing error handling means the user's API call will crash silently" beats "add error handling."
- **Be specific about what 'fixed' looks like.** "Re-run with httpx.Retry( total=3, backoff_factor=1.0 )" beats "fix the library issue." - **Be specific about what 'fixed' looks like.** "Re-run with httpx transport configured for retries" beats "fix the library issue."
- **Use tool outputs as evidence.** If tests passed, cite them. If lint is clean, cite it. Don't guess — grep for the proof. - **Use tool outputs as evidence.** If tests passed, cite them. If lint is clean, cite it. Don't guess — grep for the proof.
- **If you can't find any gaps, try harder.** A perfect score across all 5 axes is rare. Ask: "If I were the user, what would annoy me about this output?" - **If you can't find any gaps, try harder.** A perfect score across all 5 axes is rare. Ask: "If I were the user, what would annoy me about this output?"

View File

@ -6,10 +6,10 @@
## Agent Output ## Agent Output
### Summary ### Summary
Added `httpx.Retry` transport to the shared `ApiClient` class. All HTTP methods (GET, POST, PUT, DELETE) now retry on 429, 5xx, and connection errors with exponential backoff. 42 tests pass. Single PR ready to merge. Added `httpx.HTTPTransport` with retry configuration to the shared `ApiClient` class. All HTTP methods (GET, POST, PUT, DELETE) now retry on 429, 5xx, and connection errors with exponential backoff. 42 tests pass. Single PR ready to merge.
### Files Changed ### Files Changed
- `src/api_client.py` — Added `RetryTransport` wrapper with `httpx.Retry( total=3, backoff_factor=1.0, status_forcelist=[429, 500, 502, 503, 504] )` - `src/api_client.py` — Added `RetryTransport` wrapper with `httpx.HTTPTransport(retries=3)` and exponential backoff configured via `httpx.Limits`
- `tests/test_api_client.py` — Added 8 tests: happy path, retry on 429, retry on 503, max retries exhausted, timeout, backoff timing assertion, idempotent GET retry, no-retry on 4xx (except 429) - `tests/test_api_client.py` — Added 8 tests: happy path, retry on 429, retry on 503, max retries exhausted, timeout, backoff timing assertion, idempotent GET retry, no-retry on 4xx (except 429)
### Verification ### Verification

View File

@ -7,7 +7,7 @@
Here's a retry implementation. I think this should work for most cases. Here's a retry implementation. I think this should work for most cases.
I used the urllib3 library since it has built-in retry support. The Retry class handles backoff automatically so we just need to configure it. I used the urllib3 library since it has built-in retry support. The Retry class handles backoff automatically.
```python ```python
from urllib3.util import Retry from urllib3.util import Retry
@ -38,7 +38,7 @@ AGENT SELF-EVALUATION REPORT
- Untested ("I haven't tested the timeout behavior") - Untested ("I haven't tested the timeout behavior")
- Speculation without evidence ("those are probably fine") - Speculation without evidence ("those are probably fine")
→ Wrong library used. Project uses httpx, not urllib3. → Wrong library used. Project uses httpx, not urllib3.
urllib3.util.Retry is incompatible with httpx transport. urllib3.util.Retry is incompatible with httpx.
Completeness ███░░ 3/5 Completeness ███░░ 3/5
- Explicit gap acknowledged ("might be edge cases with POST") - Explicit gap acknowledged ("might be edge cases with POST")
@ -70,7 +70,7 @@ AGENT SELF-EVALUATION REPORT
OVERALL 2.8/5 OVERALL 2.8/5
TOP IMPROVEMENTS (axes scoring < 4): TOP IMPROVEMENTS (axes scoring < 4):
[Accuracy] Switch to httpx.Retry — grep the codebase to confirm the HTTP [Accuracy] Switch to httpx — grep the codebase to confirm the HTTP
library before writing code. library before writing code.
[Actionability] Create a PR with the changed file + test file. Run the [Actionability] Create a PR with the changed file + test file. Run the
tests. End with "PR #N ready to merge." tests. End with "PR #N ready to merge."

View File

@ -6,8 +6,8 @@ This reference provides concrete scoring anchors for each axis. Use it when you'
| Score | Anchor | Example | | Score | Anchor | Example |
|---|---|---| |---|---|---|
| 5 | All facts verified against tool output, docs, or authoritative sources. No errors. | Used `httpx.Retry` — confirmed in httpx docs. All method names verified with grep against codebase. | | 5 | All facts verified against tool output, docs, or authoritative sources. No errors. | Configured retry via httpx transport — confirmed in httpx docs. All method names verified with grep against codebase. |
| 4 | One minor inaccuracy that doesn't affect correctness. | Correct library, wrong default value for one parameter (httpx defaults to 1.0s, claimed 0.5s). | | 4 | One minor inaccuracy that doesn't affect correctness. | Correct library, wrong default value for one parameter (claimed 0.5s, docs say 1.0s). |
| 3 | One significant factual error, or 3+ minor inaccuracies. | Used `urllib3.Retry` in an httpx codebase. Works in this one case but wrong library. | | 3 | One significant factual error, or 3+ minor inaccuracies. | Used `urllib3.Retry` in an httpx codebase. Works in this one case but wrong library. |
| 2 | Multiple significant errors. Output would fail if followed. | Claimed "add this to package.json" but project uses pyproject.toml. Two other config claims also wrong. | | 2 | Multiple significant errors. Output would fail if followed. | Claimed "add this to package.json" but project uses pyproject.toml. Two other config claims also wrong. |
| 1 | Fundamentally incorrect. Output contradicts itself or known facts. | Code has syntax errors. API endpoint doesn't exist. Claims a function signature that grep disproves. | | 1 | Fundamentally incorrect. Output contradicts itself or known facts. | Code has syntax errors. API endpoint doesn't exist. Claims a function signature that grep disproves. |

View File

@ -1,6 +1,6 @@
# Hook Integration for Session-Stop Self-Evaluation # Hook Integration for Session-Stop Self-Evaluation
Add this hook to `hooks/hooks.json` to remind the agent to self-evaluate at the end of every session: Add this hook to `hooks/hooks.json` to remind the agent to self-evaluate at the end of every session (the hook echoes a reminder; it does not run the evaluator automatically):
```json ```json
{ {

View File

@ -144,7 +144,7 @@ def _check_jargon(text: str) -> tuple[int, list[str]]:
def _check_summary(text: str) -> tuple[int, list[str]]: def _check_summary(text: str) -> tuple[int, list[str]]:
"""Return clarity deduction when long output lacks an early summary.""" """Return clarity deduction when long output lacks an early summary."""
summary_terms = ["summary", "tldr", "overview", "in short"] summary_terms = ["summary", "tldr", "overview", "in short"]
has_early_summary = any(term in text[:100].lower() for term in summary_terms) has_early_summary = any(term in ' '.join(text.split()[:100]).lower() for term in summary_terms)
if not has_early_summary and count_words(text) > 300: if not has_early_summary and count_words(text) > 300:
return 1, ["- No summary/TLDR in first 100 words (text is 300+ words)"] return 1, ["- No summary/TLDR in first 100 words (text is 300+ words)"]
return 0, [] return 0, []
@ -354,7 +354,7 @@ def format_report(scores: list[AxisScore]) -> str:
return "\n".join(lines) return "\n".join(lines)
def _read_file_or_text(path: Optional[str], required: bool = False) -> Optional[str]: def _read_file_or_text(path: Optional[str], *, required: bool = False) -> Optional[str]:
"""Read a file path or return inline text when allowed.""" """Read a file path or return inline text when allowed."""
if path is None: if path is None:
return None return None
@ -379,7 +379,7 @@ def _read_input(args: argparse.Namespace) -> tuple[Optional[str], str]:
return _read_file_or_text(args.task), sys.stdin.read() return _read_file_or_text(args.task), sys.stdin.read()
def main(): def main() -> None:
parser = argparse.ArgumentParser( parser = argparse.ArgumentParser(
description="Evaluate agent output against the 5-axis rubric" description="Evaluate agent output against the 5-axis rubric"
) )