fix: address second-round review comments

- Replace httpx.Retry references with correct httpx API usage across all files
  (httpx has no built-in Retry class; use HTTPTransport/Limits instead)
- Fix _check_summary to check first 100 words (not 100 characters)
- Fix template to only show → improvement arrow for non-5 scores
- Clarify hook documentation: hook echoes reminder, does not run evaluator
- Add return type annotation to main()
- Make required parameter keyword-only in _read_file_or_text
This commit is contained in:
Hawthorn 2026-06-10 17:59:25 +05:30
parent 2ea4d779a3
commit 7c0a0049a8
7 changed files with 23 additions and 27 deletions

View File

@ -63,24 +63,20 @@ AGENT SELF-EVALUATION REPORT
Summary: Overall score X.X/5 across 5 quality axes.
Accuracy █████ 5/5
+ [Evidence: passing tests, verified claims]
→ [Improvement if score < 5]
+ [Evidence: passing tests, verified claims] (no → when score = 5)
Completeness █████ 5/5
Completeness ████░ 4/5
+ [What's covered]
→ [Improvement if score < 5]
→ [Improvement: only shown when score < 5]
Clarity █████ 5/5
+ [Structure signals]
→ [Improvement if score < 5]
+ [Structure signals] (no → when score = 5)
Actionability █████ 5/5
+ [User can act immediately]
→ [Improvement if score < 5]
+ [User can act immediately] (no → when score = 5)
Conciseness █████ 5/5
+ [Information density]
→ [Improvement if score < 5]
+ [Information density] (no → when score = 5)
OVERALL X.X/5
@ -115,7 +111,7 @@ Summary: Overall score X.X/5 across 5 quality axes.
Accuracy █████ 5/5
+ Tests passing
+ grep confirms httpx.Retry used correctly
+ grep confirms httpx transport configured correctly
+ Import verified
Completeness ████░ 4/5
@ -192,13 +188,13 @@ Summary: Overall score X.X/5 across 5 quality axes.
OVERALL 2.8/5
CRITICAL ISSUES (axes ≤ 2):
[Accuracy] Score 2/5 — Wrong library. Use httpx.Retry, not urllib3.Retry.
[Accuracy] Score 2/5 — Wrong library. Use httpx, not urllib3.
[Actionability] Score 2/5 — No deliverable. Create a PR with test file.
Self-check: Would the user agree with this assessment? Yes — the report cites the wrong library, lack of tests, and missing deliverable.
TOP IMPROVEMENTS:
1. [Accuracy] Switch to httpx.Retry — grep the codebase first
1. [Accuracy] Switch to httpx — grep the codebase first
2. [Actionability] Create a PR with src/api_client.py + tests
3. [Completeness] Handle 429, connection errors, and timeout

View File

@ -114,7 +114,7 @@ Overall: 4.6 — One gap (timeout handling). Fix before merging.
Task: Add retry logic to HTTP client
Scorecard:
Accuracy: 2 — Used urllib3.Retry which doesn't exist in our
Accuracy: 2 — Used urllib3 which doesn't match our
httpx-based codebase. Wrong library.
Completeness: 3 — Works for GET. POST/PUT not handled (user
said "all HTTP requests").
@ -125,7 +125,7 @@ Scorecard:
3 places instead of one shared RetryConfig object.
Overall: 2.8 — Wrong library used. Needs httpx rewrite.
Fix accuracy first (switch to httpx.Retry), then extend to all
Fix accuracy first (switch to httpx), then extend to all
HTTP methods, then consolidate config.
```
@ -171,7 +171,7 @@ The evaluation is about the delivered output, not about re-arguing design decisi
- **Evaluate the output, not the process.** The user cares about what you delivered, not how many iterations you took.
- **One improvement per weak axis.** Don't list 5 things for one axis — pick the highest-impact gap.
- **Tie improvements to user impact.** "Missing error handling means the user's API call will crash silently" beats "add error handling."
- **Be specific about what 'fixed' looks like.** "Re-run with httpx.Retry( total=3, backoff_factor=1.0 )" beats "fix the library issue."
- **Be specific about what 'fixed' looks like.** "Re-run with httpx transport configured for retries" beats "fix the library issue."
- **Use tool outputs as evidence.** If tests passed, cite them. If lint is clean, cite it. Don't guess — grep for the proof.
- **If you can't find any gaps, try harder.** A perfect score across all 5 axes is rare. Ask: "If I were the user, what would annoy me about this output?"

View File

@ -6,10 +6,10 @@
## Agent Output
### Summary
Added `httpx.Retry` transport to the shared `ApiClient` class. All HTTP methods (GET, POST, PUT, DELETE) now retry on 429, 5xx, and connection errors with exponential backoff. 42 tests pass. Single PR ready to merge.
Added `httpx.HTTPTransport` with retry configuration to the shared `ApiClient` class. All HTTP methods (GET, POST, PUT, DELETE) now retry on 429, 5xx, and connection errors with exponential backoff. 42 tests pass. Single PR ready to merge.
### Files Changed
- `src/api_client.py` — Added `RetryTransport` wrapper with `httpx.Retry( total=3, backoff_factor=1.0, status_forcelist=[429, 500, 502, 503, 504] )`
- `src/api_client.py` — Added `RetryTransport` wrapper with `httpx.HTTPTransport(retries=3)` and exponential backoff configured via `httpx.Limits`
- `tests/test_api_client.py` — Added 8 tests: happy path, retry on 429, retry on 503, max retries exhausted, timeout, backoff timing assertion, idempotent GET retry, no-retry on 4xx (except 429)
### Verification

View File

@ -7,7 +7,7 @@
Here's a retry implementation. I think this should work for most cases.
I used the urllib3 library since it has built-in retry support. The Retry class handles backoff automatically so we just need to configure it.
I used the urllib3 library since it has built-in retry support. The Retry class handles backoff automatically.
```python
from urllib3.util import Retry
@ -38,7 +38,7 @@ AGENT SELF-EVALUATION REPORT
- Untested ("I haven't tested the timeout behavior")
- Speculation without evidence ("those are probably fine")
→ Wrong library used. Project uses httpx, not urllib3.
urllib3.util.Retry is incompatible with httpx transport.
urllib3.util.Retry is incompatible with httpx.
Completeness ███░░ 3/5
- Explicit gap acknowledged ("might be edge cases with POST")
@ -70,7 +70,7 @@ AGENT SELF-EVALUATION REPORT
OVERALL 2.8/5
TOP IMPROVEMENTS (axes scoring < 4):
[Accuracy] Switch to httpx.Retry — grep the codebase to confirm the HTTP
[Accuracy] Switch to httpx — grep the codebase to confirm the HTTP
library before writing code.
[Actionability] Create a PR with the changed file + test file. Run the
tests. End with "PR #N ready to merge."

View File

@ -6,8 +6,8 @@ This reference provides concrete scoring anchors for each axis. Use it when you'
| Score | Anchor | Example |
|---|---|---|
| 5 | All facts verified against tool output, docs, or authoritative sources. No errors. | Used `httpx.Retry` — confirmed in httpx docs. All method names verified with grep against codebase. |
| 4 | One minor inaccuracy that doesn't affect correctness. | Correct library, wrong default value for one parameter (httpx defaults to 1.0s, claimed 0.5s). |
| 5 | All facts verified against tool output, docs, or authoritative sources. No errors. | Configured retry via httpx transport — confirmed in httpx docs. All method names verified with grep against codebase. |
| 4 | One minor inaccuracy that doesn't affect correctness. | Correct library, wrong default value for one parameter (claimed 0.5s, docs say 1.0s). |
| 3 | One significant factual error, or 3+ minor inaccuracies. | Used `urllib3.Retry` in an httpx codebase. Works in this one case but wrong library. |
| 2 | Multiple significant errors. Output would fail if followed. | Claimed "add this to package.json" but project uses pyproject.toml. Two other config claims also wrong. |
| 1 | Fundamentally incorrect. Output contradicts itself or known facts. | Code has syntax errors. API endpoint doesn't exist. Claims a function signature that grep disproves. |

View File

@ -1,6 +1,6 @@
# Hook Integration for Session-Stop Self-Evaluation
Add this hook to `hooks/hooks.json` to remind the agent to self-evaluate at the end of every session:
Add this hook to `hooks/hooks.json` to remind the agent to self-evaluate at the end of every session (the hook echoes a reminder; it does not run the evaluator automatically):
```json
{

View File

@ -144,7 +144,7 @@ def _check_jargon(text: str) -> tuple[int, list[str]]:
def _check_summary(text: str) -> tuple[int, list[str]]:
"""Return clarity deduction when long output lacks an early summary."""
summary_terms = ["summary", "tldr", "overview", "in short"]
has_early_summary = any(term in text[:100].lower() for term in summary_terms)
has_early_summary = any(term in ' '.join(text.split()[:100]).lower() for term in summary_terms)
if not has_early_summary and count_words(text) > 300:
return 1, ["- No summary/TLDR in first 100 words (text is 300+ words)"]
return 0, []
@ -354,7 +354,7 @@ def format_report(scores: list[AxisScore]) -> str:
return "\n".join(lines)
def _read_file_or_text(path: Optional[str], required: bool = False) -> Optional[str]:
def _read_file_or_text(path: Optional[str], *, required: bool = False) -> Optional[str]:
"""Read a file path or return inline text when allowed."""
if path is None:
return None
@ -379,7 +379,7 @@ def _read_input(args: argparse.Namespace) -> tuple[Optional[str], str]:
return _read_file_or_text(args.task), sys.stdin.read()
def main():
def main() -> None:
parser = argparse.ArgumentParser(
description="Evaluate agent output against the 5-axis rubric"
)